Unit 1 Module 4

Spatial Data Analysis with R (01:450:320)

1 Data analysis and visualization

We are now moving onto the final module of this unit. We’ll learn now about manipulating, analyzing, and visualizing data. We’ll start working with tidyverse tools more now.

To access the example datasets and files for this course, you must first install the auxiliary R package sdadata using the following command. Since the repository is public, you do not need to configure a Personal Access Token (PAT) for access.

devtools::install_github("SSELP/sdadata")

For this module we will be using the following packages:

library(dplyr)
library(tidyr)
library(ggplot2)
library(sdadata)

We will start making extensive use of %>% (the pipe operator). You will save yourself much time by using the keyboard shortcut for inserting that operator, which is CMD + shift + m on a Mac, ctrl + shift + m on Windows. You will wear your fingers out typing the three characters out each time (and also not get the automatic space on either side).

2 Data preparation

The first thing we need is some data to work with. R comes with many built-in datasets, but when you are out in the wild and working on your own with R, you are generally going to get your datasets from somewhere else. That means that you will need to read them into R. Once you have read them in, you will often need to clean them up and rearrange them before they are ready to analyze. This section focuses on reading in and organizing your data in R. We are going to work with files in the commonly used csv data format.

2.1 Reading in and writing out data

We are going to work with three csv files that I downloaded from FAOStat, one of the primary sources for agricultural census data. The data I downloaded represent the planted areas and harvest quantities of maize, wheat, and sorghum for the years 1961-2024 for Tanzania and Zambia. They are bundled up here with sdadata, so you can access them by reading them in as system.files.

We can easily read in each file using base R’s read.csv function, which is widely used. However, we are going to skip straight to a more powerful csv reader provided by the tidyverse’s readr package. Even more powerful is data.table::fread, which is excellent (and as far as I know, still the fastest on the market) for reading in very large csv files, but we are sticking with the tidyverse. First, for grins, here is how you would use read.csv

f <- system.file("extdata/FAOSTAT_maize.csv", package = "sdadata")
maize <- read.csv(f)
head(maize)
#>   Domain.Code                       Domain Area.Code..FAO.                        Area Element.Code
#> 1         QCL Crops and livestock products             215 United Republic of Tanzania         5312
#> 2         QCL Crops and livestock products             215 United Republic of Tanzania         5510
#> 3         QCL Crops and livestock products             215 United Republic of Tanzania         5312
#> 4         QCL Crops and livestock products             215 United Republic of Tanzania         5510
#> 5         QCL Crops and livestock products             215 United Republic of Tanzania         5312
#> 6         QCL Crops and livestock products             215 United Republic of Tanzania         5510
#>          Element Item.Code..CPC.         Item Year.Code Year Unit  Value Flag Flag.Description Note
#> 1 Area harvested             112 Maize (corn)      1961 1961   ha 790000    E  Estimated value     
#> 2     Production             112 Maize (corn)      1961 1961    t 590000    E  Estimated value     
#> 3 Area harvested             112 Maize (corn)      1962 1962   ha 800000    E  Estimated value     
#> 4     Production             112 Maize (corn)      1962 1962    t 600000    E  Estimated value     
#> 5 Area harvested             112 Maize (corn)      1963 1963   ha 960000    E  Estimated value     
#> 6     Production             112 Maize (corn)      1963 1963    t 850000    E  Estimated value

Now the readr way

maize <- readr::read_csv(f)
#> Rows: 256 Columns: 15
#> ── Column specification ────────────────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (9): Domain.Code, Domain, Area, Element, Item, Unit, Flag, Flag.Description, Note
#> dbl (6): Area.Code..FAO., Element.Code, Item.Code..CPC., Year.Code, Year, Value
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
maize
#> # A tibble: 256 × 15
#>    Domain.Code Domain     Area.Code..FAO. Area  Element.Code Element Item.Code..CPC. Item  Year.Code
#>    <chr>       <chr>                <dbl> <chr>        <dbl> <chr>             <dbl> <chr>     <dbl>
#>  1 QCL         Crops and…             215 Unit…         5312 Area h…             112 Maiz…      1961
#>  2 QCL         Crops and…             215 Unit…         5510 Produc…             112 Maiz…      1961
#>  3 QCL         Crops and…             215 Unit…         5312 Area h…             112 Maiz…      1962
#>  4 QCL         Crops and…             215 Unit…         5510 Produc…             112 Maiz…      1962
#>  5 QCL         Crops and…             215 Unit…         5312 Area h…             112 Maiz…      1963
#>  6 QCL         Crops and…             215 Unit…         5510 Produc…             112 Maiz…      1963
#>  7 QCL         Crops and…             215 Unit…         5312 Area h…             112 Maiz…      1964
#>  8 QCL         Crops and…             215 Unit…         5510 Produc…             112 Maiz…      1964
#>  9 QCL         Crops and…             215 Unit…         5312 Area h…             112 Maiz…      1965
#> 10 QCL         Crops and…             215 Unit…         5510 Produc…             112 Maiz…      1965
#> # ℹ 246 more rows
#> # ℹ 6 more variables: Year <dbl>, Unit <chr>, Value <dbl>, Flag <chr>, Flag.Description <chr>,
#> #   Note <chr>

That reads it in as a tibble, rather than a data.frame, but remember that a tibble is an enhanced data.frame.

Right away we can see that the data look kind of messy. There isn’t anything I see in that preview that tells us much about the data. readr at least gives a summary of data columns and their type on read in. So let’s inspect what’s there:

library(dplyr)
maize %>% slice(1)  # same as maize[1, ]  
#> # A tibble: 1 × 15
#>   Domain.Code Domain      Area.Code..FAO. Area  Element.Code Element Item.Code..CPC. Item  Year.Code
#>   <chr>       <chr>                 <dbl> <chr>        <dbl> <chr>             <dbl> <chr>     <dbl>
#> 1 QCL         Crops and …             215 Unit…         5312 Area h…             112 Maiz…      1961
#> # ℹ 6 more variables: Year <dbl>, Unit <chr>, Value <dbl>, Flag <chr>, Flag.Description <chr>,
#> #   Note <chr>

That’s the first row of data from maize, using dplyr::slice instead of maize[1, ] to get it. We don’t need everything in there, and are actually interested in just a few columns actually. We’ll get to how we select those data in the next section.

First, let’s read in all three datasets:

# Chunk 1
# #1
fs <- dir(system.file("extdata/", package = "sdadata"), 
          pattern = "FAOSTAT", full.names = TRUE)
fs
#> [1] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata//FAOSTAT_maize.csv"  
#> [2] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata//FAOSTAT_sorghum.csv"
#> [3] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata//FAOSTAT_wheat.csv"
# #2
crops <- lapply(fs, readr::read_csv)
#> Rows: 256 Columns: 15
#> ── Column specification ────────────────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (9): Domain.Code, Domain, Area, Element, Item, Unit, Flag, Flag.Description, Note
#> dbl (6): Area.Code..FAO., Element.Code, Item.Code..CPC., Year.Code, Year, Value
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 256 Columns: 15
#> ── Column specification ────────────────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (9): Domain.Code, Domain, Area, Element, Item, Unit, Flag, Flag.Description, Note
#> dbl (6): Area.Code..FAO., Element.Code, Item.Code..CPC., Year.Code, Year, Value
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 256 Columns: 15
#> ── Column specification ────────────────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (9): Domain.Code, Domain, Area, Element, Item, Unit, Flag, Flag.Description, Note
#> dbl (6): Area.Code..FAO., Element.Code, Item.Code..CPC., Year.Code, Year, Value
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We used good old lapply to read all three files into a list (#2), after we used the dir function with system.file to retrieve the paths to the three csvs (#1). Let’s break that down:

# Chunk 2
# #1
system.file("extdata", package = "sdadata")
#> [1] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata"
#
# #2
dir(system.file("extdata", package = "sdadata"))
#> [1] "bioclim.tif"         "districts.geojson"   "FAOSTAT_maize.csv"   "FAOSTAT_sorghum.csv"
#> [5] "FAOSTAT_wheat.csv"   "roads.geojson"       "species.csv"
#
# #3
dir(system.file("extdata", package = "sdadata"), pattern = "FAOSTAT")
#> [1] "FAOSTAT_maize.csv"   "FAOSTAT_sorghum.csv" "FAOSTAT_wheat.csv"
#
# #4
dir(system.file("extdata", package = "sdadata"), pattern = "FAOSTAT", 
    full.names = TRUE)
#> [1] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata/FAOSTAT_maize.csv"  
#> [2] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata/FAOSTAT_sorghum.csv"
#> [3] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata/FAOSTAT_wheat.csv"

In #1, we get the file path to the extdata folder in the sdadata installed package. #2 gives shows us the names of all the files in there. #3 shows us narrows the listed files down to those whose names contains “FAOSTAT”, and then #4 returns the full file paths for each.

So that’s a quick introduction to how one can construct a file path and read in table data stored in a csv.

Let’s say we want to write out some data:

# Chunk 3
# #1
set.seed(1)
dat <- tibble(a = sample(1:10), b = rnorm(10))
# #2
td <- tempdir()
# #3
readr::write_csv(dat, path = file.path(td, "dummy.csv"))
#> Warning: The `path` argument of `write_csv()` is deprecated as of readr 1.4.0.
#> ℹ Please use the `file` argument instead.
#> This warning is displayed once per session.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
# #4
readr::read_csv(file.path(td, "dummy.csv"))
#> Rows: 10 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 10 × 2
#>        a      b
#>    <dbl>  <dbl>
#>  1     9 -0.820
#>  2     4  0.487
#>  3     7  0.738
#>  4     1  0.576
#>  5     2 -0.305
#>  6     5  1.51 
#>  7     3  0.390
#>  8    10 -0.621
#>  9     6 -2.21 
#> 10     8  1.12

In #1 we do the usual creation of dummy data, but we replace data.frame with tibble, the enhanced data.frame–note we could have used data.frame. #2 creates a temporary directory, which allows me to do something to code this demonstration in a way that will run on your computer (i.e. you won’t have to create a directory with a name and path that matches one on my computer for this code to execute at install time). #3 uses readr::write_csv to write it onto disk, and we read it back in in #4.

2.2 Getting your data clean and in shape

As we have already seen, our three crop datasets are messy. There are columns we don’t need, and we are not sure if we want the row structure as it is. So we have to prepare our data.

2.2.1 Tidy data

This introduces the concept of tidy data, which is the foundational concept for the tidyverse. There is a whole paper written on the tidy data concept by Hadley Wickham, which is here. To quote the key principles:

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

  1. Each variable forms a column.
  1. Each observation forms a row.
  1. Each type of observational unit forms a table.

…Messy data is any other arrangement of the data.

Please do read that site to get a full understanding of it, as we are going to be reorganizing our data according to these principles.

2.2.2 Selecting

Let’s start by getting rid of some of the extraneous variables in our data. We’ll start with just the maize dataset, which we read in on its own above. Having already looked at it, we know there are a bunch of columns we don’t need, so we will pull out the essentials:

# Chunk 4
# dplyr selection
maize <- maize %>% dplyr::select(Item, Area, Element, Year, Value)
# base R (not run)
# maize <- maize[, c("Item", "Area", "Element", "Year", "Value")]

Which reduces maize down to the columns Item (crop name), the Area (country), the Element (variable, the Year, and the Value of the variable. Note that we used dplyr::select (note the :: specification here, due to a namespace conflict with raster::select) to grab the columns we wanted, instead of the base R way of selection (shown commented out).

Year and Value store numeric values, but Item, Area, and Element are categorical (character). So let’s look at what values are stored in them:

# Chunk 5
maize %>% distinct(Item, Area, Element)
#> # A tibble: 4 × 3
#>   Item         Area                        Element       
#>   <chr>        <chr>                       <chr>         
#> 1 Maize (corn) United Republic of Tanzania Area harvested
#> 2 Maize (corn) United Republic of Tanzania Production    
#> 3 Maize (corn) Zambia                      Area harvested
#> 4 Maize (corn) Zambia                      Production
# unique(maize[c("Item", "Area", "Element")])

We use the distinct function to select out the unique values contained in each of those columns. You could do this in base R also, which is also shown in the commented out code.

This exercise tells us something. We have one violation of the tidy principles. Item and Area seem correct, they are storing variables: crop name and country name. Element, on the other hand, contains:

Multiple variables are stored in one column

Which is one of the definitions of messy data.

2.2.3 Reshaping

So we need to reshape the dataset. How? Well, the values “Area harvested” and “Production” in Element are both stored in the neighboring Value column. We need to make two new columns out of values in Value, pulling out the ones corresponding to Element “Production” into one new column, and Element “Area harvested” into another. This is called spreading, or going from long (or tall) to wide:

# Chunk 6
maize <- maize %>% pivot_wider(names_from = Element, values_from = Value)
# maize <- maize %>% spread(key = Element, value = Value)
maize
#> # A tibble: 128 × 5
#>    Item         Area                         Year `Area harvested` Production
#>    <chr>        <chr>                       <dbl>            <dbl>      <dbl>
#>  1 Maize (corn) United Republic of Tanzania  1961           790000     590000
#>  2 Maize (corn) United Republic of Tanzania  1962           800000     600000
#>  3 Maize (corn) United Republic of Tanzania  1963           960000     850000
#>  4 Maize (corn) United Republic of Tanzania  1964           930000     720000
#>  5 Maize (corn) United Republic of Tanzania  1965           950000     751000
#>  6 Maize (corn) United Republic of Tanzania  1966          1100000     880000
#>  7 Maize (corn) United Republic of Tanzania  1967          1000000     750000
#>  8 Maize (corn) United Republic of Tanzania  1968          1014000     551000
#>  9 Maize (corn) United Republic of Tanzania  1969          1014000     638000
#> 10 Maize (corn) United Republic of Tanzania  1970          1015000     488000
#> # ℹ 118 more rows

We use tidyr::pivot_wider to do that, specifying the Element column as the one that contains the names of the new columns (“names_from”), and Value as the column from which we take the values to fill under each column (“values_from”). Note that this function replaces the commented-out one below it, which is the original tidyr::spread function. pivot_wider is considered more intuitive. The inverse procedure is known as gathering, where columns are re-organized into key-value pairs:

# Chunk 7
maize %>% pivot_longer(cols = c("Area harvested", "Production"), 
                       names_to = "Element", values_to = "Value")
# maize %>% gather(key = Element, value = Value, `Area harvested`, Production)
#> # A tibble: 256 × 5
#>    Item         Area                         Year Element         Value
#>    <chr>        <chr>                       <dbl> <chr>           <dbl>
#>  1 Maize (corn) United Republic of Tanzania  1961 Area harvested 790000
#>  2 Maize (corn) United Republic of Tanzania  1961 Production     590000
#>  3 Maize (corn) United Republic of Tanzania  1962 Area harvested 800000
#>  4 Maize (corn) United Republic of Tanzania  1962 Production     600000
#>  5 Maize (corn) United Republic of Tanzania  1963 Area harvested 960000
#>  6 Maize (corn) United Republic of Tanzania  1963 Production     850000
#>  7 Maize (corn) United Republic of Tanzania  1964 Area harvested 930000
#>  8 Maize (corn) United Republic of Tanzania  1964 Production     720000
#>  9 Maize (corn) United Republic of Tanzania  1965 Area harvested 950000
#> 10 Maize (corn) United Republic of Tanzania  1965 Production     751000
#> # ℹ 246 more rows

Here we use tidyr::pivot_longer (which replaced tidyr::gather) to reshape maize back to its original form. We have three arguments, “cols”, which designates the columns you want to convert to combine into long form, “names_to”, which specifies the name of the column that will hold the key variable, and “values_to”, which specifies the name of the column that will hold the values. The commented out line shows how to do the same thing with the older gather function. Gathering, also known as going from wide to long (or tall), is probably more common than spreading, but for our datasets we have to spread, because the original form keeps two clearly distinct variables in one column. So we will keep the reshaped maize.

2.2.4 Renaming

We still need to clean it some more though. Note in the pivot_longer operation above how Area harvested has backticks around it. That’s because it has a space in the variable name, which is bad practice. We also should remove capitals.

# Chunk 8
maize %>% rename(crop = Item, country = Area, year = Year, 
                 harv_area = `Area harvested`, prod = Production)
#> # A tibble: 128 × 5
#>    crop         country                      year harv_area   prod
#>    <chr>        <chr>                       <dbl>     <dbl>  <dbl>
#>  1 Maize (corn) United Republic of Tanzania  1961    790000 590000
#>  2 Maize (corn) United Republic of Tanzania  1962    800000 600000
#>  3 Maize (corn) United Republic of Tanzania  1963    960000 850000
#>  4 Maize (corn) United Republic of Tanzania  1964    930000 720000
#>  5 Maize (corn) United Republic of Tanzania  1965    950000 751000
#>  6 Maize (corn) United Republic of Tanzania  1966   1100000 880000
#>  7 Maize (corn) United Republic of Tanzania  1967   1000000 750000
#>  8 Maize (corn) United Republic of Tanzania  1968   1014000 551000
#>  9 Maize (corn) United Republic of Tanzania  1969   1014000 638000
#> 10 Maize (corn) United Republic of Tanzania  1970   1015000 488000
#> # ℹ 118 more rows

So that’s fairly straightforward. We use dplyr::rename to assign a new name for each column, and we then give them more informative column names. That’s the most direct way of renaming. There are more programmatic ways of doing it, but we will circle back to that later on.

2.2.5 Chaining/piping

Okay, so we have just seen a bunch of data tidying steps above. This is a good point to talk about combining operations. It is advantageous to combine operations when we have to do several things to get a desired outcome, but have no need to keep the results of intermediate steps, as in this case–we have to make several fixes to these data, but only care about their final tidied version. We can combine our tidying operations so that they are executed all at once using dplyr/tidyr pipes:

# Chunk 9
crops[[1]] %>% dplyr::select(Item, Area, Element, Year, Value) %>% 
  pivot_wider(names_from = Element, values_from = Value) %>% 
  rename(crop = Item, country = Area, year = Year, 
         harv_area = `Area harvested`, prod = Production)
#> # A tibble: 128 × 5
#>    crop         country                      year harv_area   prod
#>    <chr>        <chr>                       <dbl>     <dbl>  <dbl>
#>  1 Maize (corn) United Republic of Tanzania  1961    790000 590000
#>  2 Maize (corn) United Republic of Tanzania  1962    800000 600000
#>  3 Maize (corn) United Republic of Tanzania  1963    960000 850000
#>  4 Maize (corn) United Republic of Tanzania  1964    930000 720000
#>  5 Maize (corn) United Republic of Tanzania  1965    950000 751000
#>  6 Maize (corn) United Republic of Tanzania  1966   1100000 880000
#>  7 Maize (corn) United Republic of Tanzania  1967   1000000 750000
#>  8 Maize (corn) United Republic of Tanzania  1968   1014000 551000
#>  9 Maize (corn) United Republic of Tanzania  1969   1014000 638000
#> 10 Maize (corn) United Republic of Tanzania  1970   1015000 488000
#> # ℹ 118 more rows

We grab the first element of crops, the maize tibble, and apply all three operations sequentially by chaining them together with %>%, the pipe operator. This means we are piping the results of each operation to the next: the results from select are piped to pivot_wider, and then the results of those two to rename.

Important note: chaining/piping is not something unique to data tidying, as it applies to operations throughout the tidyverse.

Now that we have our pipes set up, we can apply them to all three datasets in an lapply:

# Chunk 10
lapply(crops, function(x) {
  x %>% dplyr::select(Item, Area, Element, Year, Value) %>% 
    pivot_wider(names_from = Element, values_from = Value) %>% 
    rename(crop = Item, country = Area, year = Year, 
           harv_area = `Area harvested`, prod = Production)
})
#> [[1]]
#> # A tibble: 128 × 5
#>    crop         country                      year harv_area   prod
#>    <chr>        <chr>                       <dbl>     <dbl>  <dbl>
#>  1 Maize (corn) United Republic of Tanzania  1961    790000 590000
#>  2 Maize (corn) United Republic of Tanzania  1962    800000 600000
#>  3 Maize (corn) United Republic of Tanzania  1963    960000 850000
#>  4 Maize (corn) United Republic of Tanzania  1964    930000 720000
#>  5 Maize (corn) United Republic of Tanzania  1965    950000 751000
#>  6 Maize (corn) United Republic of Tanzania  1966   1100000 880000
#>  7 Maize (corn) United Republic of Tanzania  1967   1000000 750000
#>  8 Maize (corn) United Republic of Tanzania  1968   1014000 551000
#>  9 Maize (corn) United Republic of Tanzania  1969   1014000 638000
#> 10 Maize (corn) United Republic of Tanzania  1970   1015000 488000
#> # ℹ 118 more rows
#> 
#> [[2]]
#> # A tibble: 128 × 5
#>    crop    country                      year harv_area   prod
#>    <chr>   <chr>                       <dbl>     <dbl>  <dbl>
#>  1 Sorghum United Republic of Tanzania  1961    200000 180000
#>  2 Sorghum United Republic of Tanzania  1962    175000 175000
#>  3 Sorghum United Republic of Tanzania  1963    162000 178000
#>  4 Sorghum United Republic of Tanzania  1964    173000 135258
#>  5 Sorghum United Republic of Tanzania  1965    192000 149124
#>  6 Sorghum United Republic of Tanzania  1966    310000 161072
#>  7 Sorghum United Republic of Tanzania  1967    310000 143627
#>  8 Sorghum United Republic of Tanzania  1968    315000 163314
#>  9 Sorghum United Republic of Tanzania  1969    310000 145179
#> 10 Sorghum United Republic of Tanzania  1970    310000 171912
#> # ℹ 118 more rows
#> 
#> [[3]]
#> # A tibble: 128 × 5
#>    crop  country                      year harv_area  prod
#>    <chr> <chr>                       <dbl>     <dbl> <dbl>
#>  1 Wheat United Republic of Tanzania  1961      8000  6100
#>  2 Wheat United Republic of Tanzania  1962     18000 17578
#>  3 Wheat United Republic of Tanzania  1963     22000 24514
#>  4 Wheat United Republic of Tanzania  1964     25000 27099
#>  5 Wheat United Republic of Tanzania  1965     33000 33019
#>  6 Wheat United Republic of Tanzania  1966     34000 38270
#>  7 Wheat United Republic of Tanzania  1967     31000 35099
#>  8 Wheat United Republic of Tanzania  1968     38000 44000
#>  9 Wheat United Republic of Tanzania  1969     38000 41000
#> 10 Wheat United Republic of Tanzania  1970     60000 57000
#> # ℹ 118 more rows

The tidying is now done, although we haven’t saved the results to a new object yet. We’ll do that next.

2.2.6 Combining datasets (and operations)

Oftentimes when one is working with data, there is a need to combine several datasets into one. These three datasets are one such example–we don’t really need to keep our three tibbles as separate elements in a list, as it might be easier to perform any further manipulations of the data if we combine them all into one big tibble. We’ll look at two ways of joining tables.

2.2.6.1 Binding

Now we’ll bind all three tibbles into a single large one:

# Chunk 11
crops_df <- do.call(rbind, lapply(crops, function(x) {
  x %>% dplyr::select(Item, Area, Element, Year, Value) %>% 
    pivot_wider(names_from = Element, values_from = Value) %>% 
    rename(crop = Item, country = Area, year = Year, 
           harv_area = `Area harvested`, prod = Production)
}))
crops_df
#> # A tibble: 384 × 5
#>    crop         country                      year harv_area   prod
#>    <chr>        <chr>                       <dbl>     <dbl>  <dbl>
#>  1 Maize (corn) United Republic of Tanzania  1961    790000 590000
#>  2 Maize (corn) United Republic of Tanzania  1962    800000 600000
#>  3 Maize (corn) United Republic of Tanzania  1963    960000 850000
#>  4 Maize (corn) United Republic of Tanzania  1964    930000 720000
#>  5 Maize (corn) United Republic of Tanzania  1965    950000 751000
#>  6 Maize (corn) United Republic of Tanzania  1966   1100000 880000
#>  7 Maize (corn) United Republic of Tanzania  1967   1000000 750000
#>  8 Maize (corn) United Republic of Tanzania  1968   1014000 551000
#>  9 Maize (corn) United Republic of Tanzania  1969   1014000 638000
#> 10 Maize (corn) United Republic of Tanzania  1970   1015000 488000
#> # ℹ 374 more rows

Building on Chunk 10, we apply the tidying procedure and then join the three tidied tibbles into one long one (crops_df) using do.call and rbind to do that last step. do.call basically says “do this function call”, which is rbind, rbind is being done to the results of the lapply. Or, broken down:

# Chunk 12
crops2 <- lapply(crops, function(x) {
  x %>% dplyr::select(Item, Area, Element, Year, Value) %>% 
    pivot_wider(names_from = Element, values_from = Value) %>% 
    rename(crop = Item, country = Area, year = Year, 
           harv_area = `Area harvested`, prod = Production)
})
crops_df <- do.call(rbind, crops2)
# crops_df <- rbind(crops2[[1]], crops2[[2]], crops2[[3]])
set.seed(1)
crops_df %>% sample_n(., size = 20)
#> # A tibble: 20 × 5
#>    crop         country                      year harv_area    prod
#>    <chr>        <chr>                       <dbl>     <dbl>   <dbl>
#>  1 Wheat        Zambia                       1964       145     214
#>  2 Sorghum      United Republic of Tanzania  1999    659868  561031
#>  3 Sorghum      United Republic of Tanzania  1961    200000  180000
#>  4 Wheat        United Republic of Tanzania  2003     26890   74000
#>  5 Wheat        United Republic of Tanzania  1974     46000   82000
#>  6 Sorghum      United Republic of Tanzania  2019    646868  731877
#>  7 Wheat        United Republic of Tanzania  2011    108287  112658
#>  8 Maize (corn) Zambia                       1981    493783 1007280
#>  9 Wheat        United Republic of Tanzania  1981     51000   95000
#> 10 Wheat        Zambia                       2002     12000   75000
#> 11 Wheat        Zambia                       1970       100     101
#> 12 Wheat        United Republic of Tanzania  1967     31000   35099
#> 13 Wheat        Zambia                       1969       100     220
#> 14 Maize (corn) Zambia                       1975   1020000 1483133
#> 15 Sorghum      Zambia                       1981     21500   13950
#> 16 Maize (corn) United Republic of Tanzania  1997   1564000 1831200
#> 17 Maize (corn) Zambia                       2001    582000  802000
#> 18 Sorghum      Zambia                       1985     24811   20227
#> 19 Wheat        Zambia                       2006     17100   93958
#> 20 Sorghum      United Republic of Tanzania  1997    622400  538200

The above shows the lapply and the do.call(rbind steps separately. Note the commented out line, which shows an alternate, less programmatic way of binding the three tables. The final line uses dplyr::sample_n to select 20 rows at random, which reveals that the new large tibble contains observations of all three crop types.

Another note: there is actually a pure tidyverse way of doing the full set of steps, using purrr’s set of map* functions to replace the lapply, but we will leave that aside for now.

2.2.6.2 Merging/joining

So we just saw how to combine datasets by binding them by rows. Oftentimes you want to combine them by adding a variable or variables from one dataset into another, using one or common variables as the key(s) for joining. In base R, this is done with the merge function, but with dplyr we use the *_join functions.

We’ll start with some dummy datasets to illustrate, and then use our own crop data to show a more complicated join using two variables.

# Chunk 13a
set.seed(1)
t1 <- tibble(v1 = paste0("N", 1:5), v2 = rnorm(5))
t1
#> # A tibble: 5 × 2
#>   v1        v2
#>   <chr>  <dbl>
#> 1 N1    -0.626
#> 2 N2     0.184
#> 3 N3    -0.836
#> 4 N4     1.60 
#> 5 N5     0.330
t2 <- tibble(v1 = paste0("N", 1:5), v3 = runif(5))
t2
#> # A tibble: 5 × 2
#>   v1       v3
#>   <chr> <dbl>
#> 1 N1    0.206
#> 2 N2    0.177
#> 3 N3    0.687
#> 4 N4    0.384
#> 5 N5    0.770
t3 <- tibble(v1 = paste0("N", 1:7), v4 = sample(1:100, 7))
             # v5 = letters[sample(1:26, 7)])
t3
#> # A tibble: 7 × 2
#>   v1       v4
#>   <chr> <int>
#> 1 N1       54
#> 2 N2       74
#> 3 N3        7
#> 4 N4       73
#> 5 N5       79
#> 6 N6       85
#> 7 N7       37
t4 <- tibble(v1 = paste0("N", c(1:2, 4:7, 11)), 
             v5 = letters[sample(1:26, 7)])
t4
#> # A tibble: 7 × 2
#>   v1    v5   
#>   <chr> <chr>
#> 1 N1    i    
#> 2 N2    y    
#> 3 N4    n    
#> 4 N5    e    
#> 5 N6    w    
#> 6 N7    b    
#> 7 N11   j

We make four tibbles, each having a common variable v1 but three unique variables (v2, v3, v4, v5).

There are several different ways to do joins, and a matching function for each. Please read about those functions (?dplyr::join).

The simplest join case is between t1 and t2, which can be done as:

# Chunk 13b
# #1
left_join(x = t1, y = t2, by = "v1")
#> # A tibble: 5 × 3
#>   v1        v2    v3
#>   <chr>  <dbl> <dbl>
#> 1 N1    -0.626 0.206
#> 2 N2     0.184 0.177
#> 3 N3    -0.836 0.687
#> 4 N4     1.60  0.384
#> 5 N5     0.330 0.770
#
# #2 
inner_join(x = t1, y = t2, by = "v1")
#> # A tibble: 5 × 3
#>   v1        v2    v3
#>   <chr>  <dbl> <dbl>
#> 1 N1    -0.626 0.206
#> 2 N2     0.184 0.177
#> 3 N3    -0.836 0.687
#> 4 N4     1.60  0.384
#> 5 N5     0.330 0.770
#
# #3 
right_join(x = t1, y = t2, by = "v1")
#> # A tibble: 5 × 3
#>   v1        v2    v3
#>   <chr>  <dbl> <dbl>
#> 1 N1    -0.626 0.206
#> 2 N2     0.184 0.177
#> 3 N3    -0.836 0.687
#> 4 N4     1.60  0.384
#> 5 N5     0.330 0.770
# 
# #
full_join(x = t1, y = t2, by = "v1")
#> # A tibble: 5 × 3
#>   v1        v2    v3
#>   <chr>  <dbl> <dbl>
#> 1 N1    -0.626 0.206
#> 2 N2     0.184 0.177
#> 3 N3    -0.836 0.687
#> 4 N4     1.60  0.384
#> 5 N5     0.330 0.770

We use inner_join, left_join, right_join, or full_join to merge together t1 and t2, getting identical results because the two joins have have the same number of rows and a unique key value per row on the join variable (v1). The results begin to diverge when we have differing numbers of rows:

# Chunk 14
# #1
inner_join(x = t1, y = t3, by = "v1")
#> # A tibble: 5 × 3
#>   v1        v2    v4
#>   <chr>  <dbl> <int>
#> 1 N1    -0.626    54
#> 2 N2     0.184    74
#> 3 N3    -0.836     7
#> 4 N4     1.60     73
#> 5 N5     0.330    79
#
# #2
left_join(x = t1, y = t3, by = "v1")
#> # A tibble: 5 × 3
#>   v1        v2    v4
#>   <chr>  <dbl> <int>
#> 1 N1    -0.626    54
#> 2 N2     0.184    74
#> 3 N3    -0.836     7
#> 4 N4     1.60     73
#> 5 N5     0.330    79
#
# #3
right_join(x = t1, y = t3, by = "v1")
#> # A tibble: 7 × 3
#>   v1        v2    v4
#>   <chr>  <dbl> <int>
#> 1 N1    -0.626    54
#> 2 N2     0.184    74
#> 3 N3    -0.836     7
#> 4 N4     1.60     73
#> 5 N5     0.330    79
#> 6 N6    NA        85
#> 7 N7    NA        37
# 
# #4 
full_join(x = t1, y = t3, by = "v1")
#> # A tibble: 7 × 3
#>   v1        v2    v4
#>   <chr>  <dbl> <int>
#> 1 N1    -0.626    54
#> 2 N2     0.184    74
#> 3 N3    -0.836     7
#> 4 N4     1.60     73
#> 5 N5     0.330    79
#> 6 N6    NA        85
#> 7 N7    NA        37

In #1 and #2, t1 and t3 produce identical results using either inner_join or left_join, even though they do slightly different things (note: quotes are from older documentation):

  • ?inner_join:

    …return all rows from x where there are matching values in y, and all columns from x and y.

  • ?left_join:

    …return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns.

Since t1 is the x object, and t3 is the y, both join functions drop the two extra rows in t3 because they have no matches in t1.

That contrasts with #3, where we use a right_join, which returns (from ?right_join):

all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns.

So we the unmatched rows from t3 (the last two) in the joined result are filled with NAs in the v2 column that came in from t1. full_join produces the same result, because it returns (from ?full_join):

return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

You get the same results in the last two because the values in t1’s v1 column are a subset of t3’s v1 values. Let’s see what happens when both the values in both join columns each have values not shared by the other:

# Chunk 15a
# #1
inner_join(x = t3, y = t4, by = "v1")
#> # A tibble: 6 × 3
#>   v1       v4 v5   
#>   <chr> <int> <chr>
#> 1 N1       54 i    
#> 2 N2       74 y    
#> 3 N4       73 n    
#> 4 N5       79 e    
#> 5 N6       85 w    
#> 6 N7       37 b
#
# #2
left_join(x = t3, y = t4, by = "v1")
#> # A tibble: 7 × 3
#>   v1       v4 v5   
#>   <chr> <int> <chr>
#> 1 N1       54 i    
#> 2 N2       74 y    
#> 3 N3        7 <NA> 
#> 4 N4       73 n    
#> 5 N5       79 e    
#> 6 N6       85 w    
#> 7 N7       37 b
#
# #3
right_join(x = t3, y = t4, by = "v1")
#> # A tibble: 7 × 3
#>   v1       v4 v5   
#>   <chr> <int> <chr>
#> 1 N1       54 i    
#> 2 N2       74 y    
#> 3 N4       73 n    
#> 4 N5       79 e    
#> 5 N6       85 w    
#> 6 N7       37 b    
#> 7 N11      NA j
# 
# #4 
full_join(x = t3, y = t4, by = "v1")
#> # A tibble: 8 × 3
#>   v1       v4 v5   
#>   <chr> <int> <chr>
#> 1 N1       54 i    
#> 2 N2       74 y    
#> 3 N3        7 <NA> 
#> 4 N4       73 n    
#> 5 N5       79 e    
#> 6 N6       85 w    
#> 7 N7       37 b    
#> 8 N11      NA j

Each of the four results produces a different output (and it is left to you to describe why in the practice questions below).

Right, so that is an introduction to some different types of joins, and their varying behavior. We can join all four together using pipes:

# Chunk 15b
full_join(t1, t2) %>% full_join(., t3) %>% full_join(., t4)
#> Joining with `by = join_by(v1)`
#> Joining with `by = join_by(v1)`
#> Joining with `by = join_by(v1)`
#> # A tibble: 8 × 5
#>   v1        v2     v3    v4 v5   
#>   <chr>  <dbl>  <dbl> <int> <chr>
#> 1 N1    -0.626  0.206    54 i    
#> 2 N2     0.184  0.177    74 y    
#> 3 N3    -0.836  0.687     7 <NA> 
#> 4 N4     1.60   0.384    73 n    
#> 5 N5     0.330  0.770    79 e    
#> 6 N6    NA     NA        85 w    
#> 7 N7    NA     NA        37 b    
#> 8 N11   NA     NA        NA j

That is the most inclusive join of all four tibbles. Notice that we dropped the argument names, and the join variable is automatically found.

We are now going to end by doing a more complicated join based on three columns, using our crop data:

# Chunk 16
# #1
yields <- crops_df %>% 
  mutate(yield = prod / harv_area) %>% 
  dplyr::select(crop, country, year, yield)
yields %>% slice(1:2)
#> # A tibble: 2 × 4
#>   crop         country                      year yield
#>   <chr>        <chr>                       <dbl> <dbl>
#> 1 Maize (corn) United Republic of Tanzania  1961 0.747
#> 2 Maize (corn) United Republic of Tanzania  1962 0.75
# 
# #2
crop_ylds <- left_join(x = crops_df, y = yields, 
                       by = c("crop", "country", "year"))
#
# 3
crop_ylds <- left_join(crops_df, yields) 
#> Joining with `by = join_by(crop, country, year)`
crop_ylds
#> # A tibble: 384 × 6
#>    crop         country                      year harv_area   prod yield
#>    <chr>        <chr>                       <dbl>     <dbl>  <dbl> <dbl>
#>  1 Maize (corn) United Republic of Tanzania  1961    790000 590000 0.747
#>  2 Maize (corn) United Republic of Tanzania  1962    800000 600000 0.75 
#>  3 Maize (corn) United Republic of Tanzania  1963    960000 850000 0.885
#>  4 Maize (corn) United Republic of Tanzania  1964    930000 720000 0.774
#>  5 Maize (corn) United Republic of Tanzania  1965    950000 751000 0.791
#>  6 Maize (corn) United Republic of Tanzania  1966   1100000 880000 0.8  
#>  7 Maize (corn) United Republic of Tanzania  1967   1000000 750000 0.75 
#>  8 Maize (corn) United Republic of Tanzania  1968   1014000 551000 0.543
#>  9 Maize (corn) United Republic of Tanzania  1969   1014000 638000 0.629
#> 10 Maize (corn) United Republic of Tanzania  1970   1015000 488000 0.481
#> # ℹ 374 more rows

First, we make a new tibble that calculates crop yield from production and harvested area (yield = production / harvest area). We use the mutate function (more on that in section 2.2.8) to create the new variable, and select just crop, country, year, and yield to make a new yields tibble.

We then use a left_join to merge yields onto crops_df. We specify three columns in this case, because there is no one variable that provides a unique key, nor is there two variables that do. However, the values in crop, country, and year do provide a unique combination for each row, so we join on those. #2 makes that explicit, but we see in #3 that left_join handles this for us automatically if we don’t.

2.2.7 Arranging

That last line on Chunk #12 (crops_df %>% sample_n(., size = 20)) gives an opportunity to introduce another aspect of data organizing, which is sorting. Note that the years are out of order in that final random draw. We can rearrange that so that they are sequential:

# Chunk 17
set.seed(1)
# 
# #1
crops_df %>% 
  sample_n(., size = 20) %>% 
  arrange(year)
#> # A tibble: 20 × 5
#>    crop         country                      year harv_area    prod
#>    <chr>        <chr>                       <dbl>     <dbl>   <dbl>
#>  1 Sorghum      United Republic of Tanzania  1961    200000  180000
#>  2 Wheat        Zambia                       1964       145     214
#>  3 Wheat        United Republic of Tanzania  1967     31000   35099
#>  4 Wheat        Zambia                       1969       100     220
#>  5 Wheat        Zambia                       1970       100     101
#>  6 Wheat        United Republic of Tanzania  1974     46000   82000
#>  7 Maize (corn) Zambia                       1975   1020000 1483133
#>  8 Maize (corn) Zambia                       1981    493783 1007280
#>  9 Wheat        United Republic of Tanzania  1981     51000   95000
#> 10 Sorghum      Zambia                       1981     21500   13950
#> 11 Sorghum      Zambia                       1985     24811   20227
#> 12 Maize (corn) United Republic of Tanzania  1997   1564000 1831200
#> 13 Sorghum      United Republic of Tanzania  1997    622400  538200
#> 14 Sorghum      United Republic of Tanzania  1999    659868  561031
#> 15 Maize (corn) Zambia                       2001    582000  802000
#> 16 Wheat        Zambia                       2002     12000   75000
#> 17 Wheat        United Republic of Tanzania  2003     26890   74000
#> 18 Wheat        Zambia                       2006     17100   93958
#> 19 Wheat        United Republic of Tanzania  2011    108287  112658
#> 20 Sorghum      United Republic of Tanzania  2019    646868  731877
# 
# #2
crops_df %>% 
  sample_n(., size = 20) %>% 
  arrange(country, year)
#> # A tibble: 20 × 5
#>    crop         country                      year harv_area     prod
#>    <chr>        <chr>                       <dbl>     <dbl>    <dbl>
#>  1 Maize (corn) United Republic of Tanzania  1980   1400000 1726000 
#>  2 Maize (corn) United Republic of Tanzania  1985   1576280 2093000 
#>  3 Wheat        United Republic of Tanzania  1993     43965   59000 
#>  4 Wheat        United Republic of Tanzania  1994     48800   59000 
#>  5 Maize (corn) United Republic of Tanzania  2000   1017600 1965400 
#>  6 Maize (corn) United Republic of Tanzania  2002   1718200 4408420 
#>  7 Maize (corn) United Republic of Tanzania  2004   3173070 4651370 
#>  8 Sorghum      United Republic of Tanzania  2004    697220  648540 
#>  9 Wheat        Zambia                       1966        41      81 
#> 10 Maize (corn) Zambia                       1966    820000  770000 
#> 11 Sorghum      Zambia                       1966     72000   47000 
#> 12 Wheat        Zambia                       1970       100     101 
#> 13 Wheat        Zambia                       1980      2400    9765 
#> 14 Wheat        Zambia                       1983      3044   10936 
#> 15 Maize (corn) Zambia                       1985    581846 1122351 
#> 16 Wheat        Zambia                       2002     12000   75000 
#> 17 Maize (corn) Zambia                       2007    585291 1366158 
#> 18 Sorghum      Zambia                       2016     26673   14106 
#> 19 Maize (corn) Zambia                       2017   1433944 3606549 
#> 20 Wheat        Zambia                       2022     33568  234925.
# 
# #3
crops_df %>% 
  sample_n(., size = 20) %>% 
  arrange(crop, country, year)
#> # A tibble: 20 × 5
#>    crop         country                      year harv_area     prod
#>    <chr>        <chr>                       <dbl>     <dbl>    <dbl>
#>  1 Maize (corn) United Republic of Tanzania  1974   1100000  761000 
#>  2 Maize (corn) United Republic of Tanzania  1982   1231550 1654000 
#>  3 Maize (corn) United Republic of Tanzania  1999    957550 2420940 
#>  4 Maize (corn) United Republic of Tanzania  2005   3109590 3131610 
#>  5 Maize (corn) Zambia                       1999    597454  822056 
#>  6 Maize (corn) Zambia                       2000    586907 1040000 
#>  7 Sorghum      United Republic of Tanzania  1962    175000  175000 
#>  8 Sorghum      United Republic of Tanzania  1992    683071  587128 
#>  9 Sorghum      Zambia                       1961     67000   40000 
#> 10 Sorghum      Zambia                       1974     60000   23140 
#> 11 Sorghum      Zambia                       1998     35864   25399 
#> 12 Sorghum      Zambia                       2023     13749    6836.
#> 13 Wheat        United Republic of Tanzania  1984     52010   83000 
#> 14 Wheat        United Republic of Tanzania  2002     30670   77000 
#> 15 Wheat        Zambia                       1966        41      81 
#> 16 Wheat        Zambia                       1971       100     100 
#> 17 Wheat        Zambia                       2010     27291  172256 
#> 18 Wheat        Zambia                       2011     37631  237332 
#> 19 Wheat        Zambia                       2012     37209  253522 
#> 20 Wheat        Zambia                       2018     21675  114463.
# 
# #4
crops_df %>% 
  sample_n(., size = 20) %>% 
  arrange(crop, country, -year)
#> # A tibble: 20 × 5
#>    crop         country                      year harv_area     prod
#>    <chr>        <chr>                       <dbl>     <dbl>    <dbl>
#>  1 Maize (corn) United Republic of Tanzania  2005   3109590 3131610 
#>  2 Maize (corn) United Republic of Tanzania  1993   1824000 2282200 
#>  3 Maize (corn) United Republic of Tanzania  1989   1980000 3128000 
#>  4 Maize (corn) United Republic of Tanzania  1973   1000000  887000 
#>  5 Maize (corn) Zambia                       2006    618959 1424400 
#>  6 Maize (corn) Zambia                       2004    631000 1214000 
#>  7 Maize (corn) Zambia                       1998    510372  638134 
#>  8 Maize (corn) Zambia                       1980    555410  937266 
#>  9 Sorghum      United Republic of Tanzania  2008    566760  551270 
#> 10 Sorghum      United Republic of Tanzania  1981    700000  425000 
#> 11 Sorghum      United Republic of Tanzania  1973    353749  192406 
#> 12 Sorghum      Zambia                       2020     35222   20011.
#> 13 Sorghum      Zambia                       1989     52008   33757 
#> 14 Sorghum      Zambia                       1985     24811   20227 
#> 15 Wheat        United Republic of Tanzania  2008     43160   43360 
#> 16 Wheat        United Republic of Tanzania  2000     71700   32700 
#> 17 Wheat        United Republic of Tanzania  1991     50300   84000 
#> 18 Wheat        United Republic of Tanzania  1983     47620   74000 
#> 19 Wheat        Zambia                       1999      9921   69226 
#> 20 Wheat        Zambia                       1985      5000   17156

We are taking the same random sample in each example above, ordering the resulting outputs differently in each case: #1 orders ascending by year (oldest to most recent); #2 orders ascending first by country than by year; #3 ascending by crop, country, and year; #4 by ascending by crop and country, but descending by year (note the - sign).

We can also sort our crop dataset:

# Chunk 18
# #1
crops_df %>% arrange(desc(year), country, desc(crop))
#> # A tibble: 384 × 5
#>    crop         country                      year harv_area      prod
#>    <chr>        <chr>                       <dbl>     <dbl>     <dbl>
#>  1 Wheat        United Republic of Tanzania  2024     86500   119000 
#>  2 Sorghum      United Republic of Tanzania  2024    750000   800000 
#>  3 Maize (corn) United Republic of Tanzania  2024   5741681 10084000 
#>  4 Wheat        Zambia                       2024     28052   198886.
#>  5 Sorghum      Zambia                       2024      4524     2865.
#>  6 Maize (corn) Zambia                       2024    684402  1511143.
#>  7 Wheat        United Republic of Tanzania  2023     60000    86522 
#>  8 Sorghum      United Republic of Tanzania  2023    683967   737819 
#>  9 Maize (corn) United Republic of Tanzania  2023   4200000  8010949 
#> 10 Wheat        Zambia                       2023     36551   277492.
#> # ℹ 374 more rows
#
# #2
crops_df %>% arrange(-year, country, -crop)
#> Error in `arrange()`:
#> ℹ In argument: `..3 = -crop`.
#> Caused by error in `-crop`:
#> ! invalid argument to unary operator
# crops_df %>% arrange(-year, country, desc(crop))

In #1 we sort first by year, in descending order (desc() does the same as -, meaning it arranges in descending order), then country alphabetically, and then by crop in reverse alphabetical order. Note that in #2 we can’t use the - to do the reverse sort on crop, but the first - on year is valid. Because the data type of year is numeric but is character for crop. The commented out variant would work if run.

2.2.8 Mutating

The last thing we will look at in this section is mutation, which uses the mutate function. That is, we use it to change the data.frame tibble by either altering existing variables or adding new ones. This brings us back to our crops_df. In Chunk 16 we created a new yields tibble so that we could demonstrate a merge with our crops_df. That was an entirely unnecessary operation, because if I wanted to create a separate yield column for crops_df, I could have just done this:

# Chunk 19
crop_ylds <- crops_df %>% mutate(yield = prod / harv_area)
crop_ylds
#> # A tibble: 384 × 6
#>    crop         country                      year harv_area   prod yield
#>    <chr>        <chr>                       <dbl>     <dbl>  <dbl> <dbl>
#>  1 Maize (corn) United Republic of Tanzania  1961    790000 590000 0.747
#>  2 Maize (corn) United Republic of Tanzania  1962    800000 600000 0.75 
#>  3 Maize (corn) United Republic of Tanzania  1963    960000 850000 0.885
#>  4 Maize (corn) United Republic of Tanzania  1964    930000 720000 0.774
#>  5 Maize (corn) United Republic of Tanzania  1965    950000 751000 0.791
#>  6 Maize (corn) United Republic of Tanzania  1966   1100000 880000 0.8  
#>  7 Maize (corn) United Republic of Tanzania  1967   1000000 750000 0.75 
#>  8 Maize (corn) United Republic of Tanzania  1968   1014000 551000 0.543
#>  9 Maize (corn) United Republic of Tanzania  1969   1014000 638000 0.629
#> 10 Maize (corn) United Republic of Tanzania  1970   1015000 488000 0.481
#> # ℹ 374 more rows

Thereby avoiding the joining process entirely and doing in one step what we did in two steps in Chunk 16. That is courtesy of the mutate function.

We can also use mutate to help us modify existing rows. Let’s do that. I don’t like having the full country name in the tibble, particularly with spaces in the name. Let’s shorten those.

# Chunk 20
set.seed(1)
crop_ylds %>% 
  mutate(country = ifelse(country == "United Republic of Tanzania", "TZA", country), 
         country = ifelse(country == "Zambia", "ZMB", country)) %>% 
  sample_n(5)
#> # A tibble: 5 × 6
#>   crop    country  year harv_area   prod yield
#>   <chr>   <chr>   <dbl>     <dbl>  <dbl> <dbl>
#> 1 Wheat   ZMB      1964       145    214 1.48 
#> 2 Sorghum TZA      1999    659868 561031 0.850
#> 3 Sorghum TZA      1961    200000 180000 0.9  
#> 4 Wheat   TZA      2003     26890  74000 2.75 
#> 5 Wheat   TZA      1974     46000  82000 1.78
set.seed(1)
crop_ylds %>% 
  mutate(country = ifelse(country == "United Republic of Tanzania", "TZA", country)) %>%  
  mutate(country = ifelse(country == "Zambia", "ZMB", country)) %>% 
  sample_n(5)
#> # A tibble: 5 × 6
#>   crop    country  year harv_area   prod yield
#>   <chr>   <chr>   <dbl>     <dbl>  <dbl> <dbl>
#> 1 Wheat   ZMB      1964       145    214 1.48 
#> 2 Sorghum TZA      1999    659868 561031 0.850
#> 3 Sorghum TZA      1961    200000 180000 0.9  
#> 4 Wheat   TZA      2003     26890  74000 2.75 
#> 5 Wheat   TZA      1974     46000  82000 1.78

In #1 we use the mutate function with ifelse to change all values that match “United Republic of Tanzania” to “TZA”, otherwise let them be, and all values matching “Zambia” to “ZMB”, otherwise leave them be. In this case, we wrap both arguments in a single mutate call, separated by a comma. We could also use separate mutate calls for each change, as we do in #2. In both cases we sample 5 rows to show that the changes were made for both countries.

We’ll wrap it all up now:

# Chunk 21
crop_ylds <- crop_ylds %>% 
  mutate(country = ifelse(country == "United Republic of Tanzania", "TZA", country)) %>%  
  mutate(country = ifelse(country == "Zambia", "ZMB", country)) %>% 
  mutate(crop = tolower(crop)) %>%
  mutate(crop = ifelse(crop == "maize (corn)", "maize", crop))
set.seed(1)
crop_ylds %>% sample_n(5)
#> # A tibble: 5 × 6
#>   crop    country  year harv_area   prod yield
#>   <chr>   <chr>   <dbl>     <dbl>  <dbl> <dbl>
#> 1 wheat   ZMB      1964       145    214 1.48 
#> 2 sorghum TZA      1999    659868 561031 0.850
#> 3 sorghum TZA      1961    200000 180000 0.9  
#> 4 wheat   TZA      2003     26890  74000 2.75 
#> 5 wheat   TZA      1974     46000  82000 1.78

We change all the values in country, as before, then change the crop names to lowercase, and clean the name of maize all in a single statement.

There are many other ways to mutate data, using more advanced pattern matching capabilities and other features. We will start to see more of those in subsequent sections.

2.3 Practice

2.3.1 Questions

  1. What is the difference between a tibble and a data.frame?

  2. Given a tibble tb_a, with columns a and b, how would extract column a using base R methods? Using tidyverse methods?

  3. You are given a csv that is organized into 6 rows and 14 columns. The first two columns contain country names (two countries) and district names (three districts per country). The other 12 columns are named by the month of th year, and each month contains the average number of measles cases reported over a 5 year period in that month for each district in each country. Are these data tidy or messy? If messy, how would you rearrange them?

  4. Describe why each of the four outputs from Chunk 15a differs as a function of the *_join function used.

2.3.2 Code

  1. Re-create the dummy tibble in Chunk 3, and write it out to a csv using readr::write_csv, to any directory convenient for you (e.g. into your notebooks/ folder), calling it “my_dummy_file.csv”.

  2. Use dir to list files in the directory you write it into matching the word “dummy”, and then read it back in using readr::read_csv.

  3. Building on the example in Chunk 15a, join all 4 tibbles together, but replace full_join with left_join. Do the same again using right_join.

  4. Rearrange the results of the two joins you just did, using v5 as the sort. Do an ascending sort on the results of the chained left_join, and descending on the right_join.

  5. Recreate the crop_ylds tibble. Skip the part where we created yield via a merge, and instead use the approach where we just did a mutate. Don’t forget to change the crop capitalization and change the country names also. After you reproduce what it looks like in Chunk 21, use mutate to add a new column harv_km2 by doing harv_area / 100.

  6. Now rename the new harv_km2 variable to harv_area_km2, using the rename function.

  7. Do all of this in a single pipeline: 1) create a tibble with variables v1 (values 1:10) and v2 (values 11:20); 2) rbind that with another tibble with the same variables but with values 11:20 and 20:30 (hint: you will need to wrap the second tibble in an rbind, e.g. rbind(., tibble())); 3) add a new column v3 to the result of that, which is the square of v2; 4) arrange it in descending order of v3; 5) capture the result of all that into my_tb

  8. End by selecting rows 1, 10, and 17 and v2 and v3 from my_tb, using slice and select

3 Analyzing your data

Now that we have gotten an introduction to data preparation steps, we will learn about how to analyze it. We have already seen some very basic analyses, in the form of basic algebra and data summaries. Now we will move onto more advanced data manipulation and analyses.

3.1 Split-apply-combine

Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.

So says Hadley Wickham in a paper that describes this common data analysis strategy and the plyr package, which is the forerunner of dplyr.

Long or tall (i.e. tidy) data formats facilitate split-apply-combine (SAC) analyses. Let’s come back to our crop_ylds dataset to look at this:

# Chunk 22
crop_ylds %>% 
  group_by(crop) %>% 
  summarize(mean_yield = mean(yield))
#> # A tibble: 3 × 2
#>   crop    mean_yield
#>   <chr>        <dbl>
#> 1 maize        1.54 
#> 2 sorghum      0.780
#> 3 wheat        3.04

We just did an SAC on the crop_ylds data that entailed splitting the dataset by crop type, applying the mean function, and then combining the results of that into a summary tibble. We can use an lapply and rbind to illustrate the inner workings of that SAC:

# Chunk 23
# #1
splt_apply <- lapply(unique(crop_ylds$crop), function(x) {
  dat <- crop_ylds[crop_ylds$crop == x, ]  # here's the split
  cat("\n")  # add empty line
  print(paste("Split out", x))  # print statement to mark split  
  print(dat[1, ])  # print showing first line of split tibble
  o <- data.frame(crop = x, mean_yield = mean(dat$yield))  # here the apply
})
#> 
#> [1] "Split out maize"
#> # A tibble: 1 × 6
#>   crop  country  year harv_area   prod yield
#>   <chr> <chr>   <dbl>     <dbl>  <dbl> <dbl>
#> 1 maize TZA      1961    790000 590000 0.747
#> 
#> [1] "Split out sorghum"
#> # A tibble: 1 × 6
#>   crop    country  year harv_area   prod yield
#>   <chr>   <chr>   <dbl>     <dbl>  <dbl> <dbl>
#> 1 sorghum TZA      1961    200000 180000   0.9
#> 
#> [1] "Split out wheat"
#> # A tibble: 1 × 6
#>   crop  country  year harv_area  prod yield
#>   <chr> <chr>   <dbl>     <dbl> <dbl> <dbl>
#> 1 wheat TZA      1961      8000  6100 0.762
#
# #2
do.call(rbind, splt_apply)  # here's the combine
#>      crop mean_yield
#> 1   maize   1.538175
#> 2 sorghum   0.779559
#> 3   wheat   3.035511

Chunk 23 #1 contains both the split (using base R syntax rather than dplyr, to avoid mixing syntaxes) and apply parts (see the comments in the code), with some extra print statements added so that we can see how the splits are happening within the lapply (by showing the first row of each split-out tibble. Note how unique is used here–to get a vector of unique crop names that are passed to an anonymous function that iterates over each crop name, which is used with logical indexing to subset (split) out the rows from crop_ylds that correspond to that crop. #2 does the combine, creating the same result produced using dplyr in Chunk 22. Given that, you will not dplyr::group_by does the splitting work, and dplyr::summarize does the applying (of mean), and the combine is implicit.

Now let’s look at a more complex SAC:

# Chunk 24
crop_ylds %>% 
  group_by(crop, country) %>% 
  summarize(mean_yield = mean(yield))
#> `summarise()` has grouped output by 'crop'. You can override using the `.groups` argument.
#> # A tibble: 6 × 3
#> # Groups:   crop [3]
#>   crop    country mean_yield
#>   <chr>   <chr>        <dbl>
#> 1 maize   TZA          1.32 
#> 2 maize   ZMB          1.75 
#> 3 sorghum TZA          0.904
#> 4 sorghum ZMB          0.655
#> 5 wheat   TZA          1.38 
#> 6 wheat   ZMB          4.69

Here we do two splits: the first on crop, the second on country, so after applying and combining we get the mean yield for each crop for each country. Let’s do that with lapply:

# Chunk 25a
# #1
splt_apply <- lapply(unique(crop_ylds$crop), function(x) {
  cat("\n")  # add empty line
  print(paste("Split out", x))  # print to mark outer split
  cntry_splt_apply <- lapply(unique(crop_ylds$country), function(y) {
    dat <- crop_ylds[crop_ylds$crop == x & crop_ylds$country == y, ]  # split
    cat("\n")  # add empty line
    print(paste("...then split out", y, x))  # print to mark inner split
    print(dat[1, ])  # print to show 1st line of split tibble
    o <- data.frame(crop = x, country = y, mean_yield = mean(dat$yield))
  })
  do.call(rbind, cntry_splt_apply)
})
#> 
#> [1] "Split out maize"
#> 
#> [1] "...then split out TZA maize"
#> # A tibble: 1 × 6
#>   crop  country  year harv_area   prod yield
#>   <chr> <chr>   <dbl>     <dbl>  <dbl> <dbl>
#> 1 maize TZA      1961    790000 590000 0.747
#> 
#> [1] "...then split out ZMB maize"
#> # A tibble: 1 × 6
#>   crop  country  year harv_area   prod yield
#>   <chr> <chr>   <dbl>     <dbl>  <dbl> <dbl>
#> 1 maize ZMB      1961    750000 660000  0.88
#> 
#> [1] "Split out sorghum"
#> 
#> [1] "...then split out TZA sorghum"
#> # A tibble: 1 × 6
#>   crop    country  year harv_area   prod yield
#>   <chr>   <chr>   <dbl>     <dbl>  <dbl> <dbl>
#> 1 sorghum TZA      1961    200000 180000   0.9
#> 
#> [1] "...then split out ZMB sorghum"
#> # A tibble: 1 × 6
#>   crop    country  year harv_area  prod yield
#>   <chr>   <chr>   <dbl>     <dbl> <dbl> <dbl>
#> 1 sorghum ZMB      1961     67000 40000 0.597
#> 
#> [1] "Split out wheat"
#> 
#> [1] "...then split out TZA wheat"
#> # A tibble: 1 × 6
#>   crop  country  year harv_area  prod yield
#>   <chr> <chr>   <dbl>     <dbl> <dbl> <dbl>
#> 1 wheat TZA      1961      8000  6100 0.762
#> 
#> [1] "...then split out ZMB wheat"
#> # A tibble: 1 × 6
#>   crop  country  year harv_area  prod yield
#>   <chr> <chr>   <dbl>     <dbl> <dbl> <dbl>
#> 1 wheat ZMB      1961       380   608   1.6
#
# #2
do.call(rbind, splt_apply)
#>      crop country mean_yield
#> 1   maize     TZA  1.3240753
#> 2   maize     ZMB  1.7522742
#> 3 sorghum     TZA  0.9038347
#> 4 sorghum     ZMB  0.6552833
#> 5   wheat     TZA  1.3769024
#> 6   wheat     ZMB  4.6941196

Making this SAC with lapply requires an lapply nested in an lapply. The outer lapply splits on crop, and the inner on each country within each crop (using unique(crop_ylds$country) to get the vector of country names). mean is applied to the split data within the inner lapply.

Right, so that might look daunting, but not to worry–we are just using those lapplys to illustrate the splitting, and also to show how using dplyr pipelines can make things much simpler. So we’ll forge ahead with those.

What happens if we want to do SAC on an even finer range of data within each split, say, calculate mean yield since the year 2000 by crop and by country:

# Chunk 25b
crop_ylds %>% filter(year >= 2000) %>% group_by(crop, country) %>% 
  summarize(mean_yield = mean(yield))
#> `summarise()` has grouped output by 'crop'. You can override using the `.groups` argument.
#> # A tibble: 6 × 3
#> # Groups:   crop [3]
#>   crop    country mean_yield
#>   <chr>   <chr>        <dbl>
#> 1 maize   TZA          1.60 
#> 2 maize   ZMB          2.29 
#> 3 sorghum TZA          1.01 
#> 4 sorghum ZMB          0.707
#> 5 wheat   TZA          1.39 
#> 6 wheat   ZMB          6.67

We can add dplyr::filter with the conditional statement (year >= 2000) in it, to select observations since the year 2000, and do the SAC on those. According to ?dplyr::filter’s function description, we:

Use filter() to choose rows/cases where conditions are true. Unlike base subsetting with [, rows where the condition evaluates to NA are dropped.

How about pre-2000 mean yields?

# Chunk 26
crop_ylds %>% filter(year < 2000) %>% group_by(crop, country) %>% 
  summarize(mean_yield = mean(yield))
#> `summarise()` has grouped output by 'crop'. You can override using the `.groups` argument.
#> # A tibble: 6 × 3
#> # Groups:   crop [3]
#>   crop    country mean_yield
#>   <chr>   <chr>        <dbl>
#> 1 maize   TZA          1.15 
#> 2 maize   ZMB          1.41 
#> 3 sorghum TZA          0.839
#> 4 sorghum ZMB          0.622
#> 5 wheat   TZA          1.37 
#> 6 wheat   ZMB          3.43

Same procedure, just a different conditional operator inside filter. But maybe we want to compare pre- and post-2000 yields in the same dataset:

# Chunk 27
crop_ylds %>% 
  mutate(y2k = ifelse(year < 2000, "1961_2000", "2000_2024")) %>% 
  group_by(crop, country, y2k) %>% 
  summarize(mean_yield = mean(yield))
#> `summarise()` has grouped output by 'crop', 'country'. You can override using the `.groups`
#> argument.
#> # A tibble: 12 × 4
#> # Groups:   crop, country [6]
#>    crop    country y2k       mean_yield
#>    <chr>   <chr>   <chr>          <dbl>
#>  1 maize   TZA     1961_2000      1.15 
#>  2 maize   TZA     2000_2024      1.60 
#>  3 maize   ZMB     1961_2000      1.41 
#>  4 maize   ZMB     2000_2024      2.29 
#>  5 sorghum TZA     1961_2000      0.839
#>  6 sorghum TZA     2000_2024      1.01 
#>  7 sorghum ZMB     1961_2000      0.622
#>  8 sorghum ZMB     2000_2024      0.707
#>  9 wheat   TZA     1961_2000      1.37 
#> 10 wheat   TZA     2000_2024      1.39 
#> 11 wheat   ZMB     1961_2000      3.43 
#> 12 wheat   ZMB     2000_2024      6.67

That’s a little more complex, and doesn’t use filter. Instead, we first use mutate with ifelse to create a new variable (y2k), which indicates whether years are before 2000, or 2000 and onward. We then use y2k to apply a third split to the data after our crop and country splits, and finally end with the mean yields from our three-way splits.

That comparison might be easier if we put the pre- and post-2000 yields next to one another. That would require just one more step to do:

# Chunk 28
crop_ylds %>% 
  mutate(y2k = ifelse(year < 2000, "1961_2000", "2000_2024")) %>% 
  group_by(crop, country, y2k) %>% 
  summarize(mean_yield = mean(yield)) %>% 
  pivot_wider(names_from = y2k, values_from = mean_yield)
#> `summarise()` has grouped output by 'crop', 'country'. You can override using the `.groups`
#> argument.
#> # A tibble: 6 × 4
#> # Groups:   crop, country [6]
#>   crop    country `1961_2000` `2000_2024`
#>   <chr>   <chr>         <dbl>       <dbl>
#> 1 maize   TZA           1.15        1.60 
#> 2 maize   ZMB           1.41        2.29 
#> 3 sorghum TZA           0.839       1.01 
#> 4 sorghum ZMB           0.622       0.707
#> 5 wheat   TZA           1.37        1.39 
#> 6 wheat   ZMB           3.43        6.67

In this case we just use pivot_wider with y2K as the key, and the mean yields as the value. Much easier to compare, in my opinion, and it shows how much larger post 2000 yields are. Wait, we could calculate how much larger exactly:

# Chunk 29
crop_ylds %>% 
  mutate(y2k = ifelse(year < 2000, "1961_2000", "2000_2024")) %>% 
  group_by(crop, country, y2k) %>% 
  summarize(mean_yield = mean(yield)) %>% 
  pivot_wider(names_from = y2k, values_from = mean_yield) %>%
  mutate(yield_ratio = `2000_2024` / `1961_2000`)
#> `summarise()` has grouped output by 'crop', 'country'. You can override using the `.groups`
#> argument.
#> # A tibble: 6 × 5
#> # Groups:   crop, country [6]
#>   crop    country `1961_2000` `2000_2024` yield_ratio
#>   <chr>   <chr>         <dbl>       <dbl>       <dbl>
#> 1 maize   TZA           1.15        1.60         1.39
#> 2 maize   ZMB           1.41        2.29         1.63
#> 3 sorghum TZA           0.839       1.01         1.20
#> 4 sorghum ZMB           0.622       0.707        1.14
#> 5 wheat   TZA           1.37        1.39         1.01
#> 6 wheat   ZMB           3.43        6.67         1.95

We use another mutate to create another variable that is the ratio of post-2000 and pre-2000 yields. Maybe we are interested in a straight comparison of pre- and post-2000 maize yield differences across all countries:

# Chunk 30
crop_ylds %>% filter(crop == "maize") %>% 
  mutate(y2k = ifelse(year < 2000, "1961_2000", "2000_2024")) %>%
  group_by(y2k) %>% summarize(mean_yield = mean(yield)) %>% 
  pivot_wider(names_from = y2k, values_from = mean_yield) %>%
  mutate(yield_ratio = `2000_2024` / `1961_2000`)
#> # A tibble: 1 × 3
#>   `1961_2000` `2000_2024` yield_ratio
#>         <dbl>       <dbl>       <dbl>
#> 1        1.28        1.94        1.52

We leave it to you in the practice questions below to answer how Chunk 30 differs from Chunk 29.

3.2 Data summaries

Now that we have a sense of how to do split-apply-combine, let’s dig a bit more into the analyses that we can do as part of that process. Since R was originally designed for statistics, there are many possible analyses we can do with R. We’ll look at just a few common ones, starting with the calculation of summary statistics, outside of SAC. The easiest way to do that in R is to use the base summary function.

# Chunk 31
# #1
summary(crop_ylds$harv_area)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>      41   33560  155600  604725  802500 5741681
#
# #2
crop_ylds %>% 
  select(harv_area) %>% 
  summary()
#>    harv_area      
#>  Min.   :     41  
#>  1st Qu.:  33560  
#>  Median : 155600  
#>  Mean   : 604725  
#>  3rd Qu.: 802500  
#>  Max.   :5741681
#
# #3
crop_ylds %>% 
  pull(harv_area) %>% 
  summary()
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>      41   33560  155600  604725  802500 5741681

In #1, we use the summary function on harv_area, pulled out of crop_ylds as a vector. #2 does the same thing using dplyr verbs. Note how summary() appears at the end of the pipeline, and we don’t have to give it a variable name because it receives implicitly consumes everything coming out of the upstream pipe (in this case, just harv_area). However, that returns summary output in a table, whereas the base method returns a named vector. #3 produces the equivalent of #1, using dplyr::pull instead of select. pull is the dplyr equivalent of crop_ylds$harv_area, i.e. it pulls a variable froma data.frame or tibble into a vector.

summary is giving us a bunch of quantile values and the mean from the vector. We can also use it for more than 1 variable:

# Chunk 32
# #1
summary(crop_ylds[, c("harv_area", "prod")])
#>    harv_area            prod         
#>  Min.   :     41   Min.   :      81  
#>  1st Qu.:  33560   1st Qu.:   44999  
#>  Median : 155600   Median :  185810  
#>  Mean   : 604725   Mean   :  862514  
#>  3rd Qu.: 802500   3rd Qu.:  880799  
#>  Max.   :5741681   Max.   :10084000
#
# #2
crop_ylds %>% 
  select(harv_area, prod) %>% 
  summary()
#>    harv_area            prod         
#>  Min.   :     41   Min.   :      81  
#>  1st Qu.:  33560   1st Qu.:   44999  
#>  Median : 155600   Median :  185810  
#>  Mean   : 604725   Mean   :  862514  
#>  3rd Qu.: 802500   3rd Qu.:  880799  
#>  Max.   :5741681   Max.   :10084000

With #1 showing the base approach, and #2 the dplyr method. Fairly straight-forward. How about if we want to do it in an SAC (these approaches are not) where we need to apply some grouping?

# Chunk 33
q1 <- function(x) quantile(x, 0.25) 
q3 <- function(x) quantile(x, 0.75)
crop_ylds %>% select(crop, country, harv_area, prod) %>% 
  group_by(crop, country) %>% 
  rename(ha = harv_area, p = prod) %>% 
  summarise_all(list(min, q1, med = median, mu = mean, q3, max))
#> # A tibble: 6 × 14
#> # Groups:   crop [3]
#>   crop   country ha_fn1  p_fn1 ha_fn2  p_fn2 ha_med  p_med  ha_mu   p_mu ha_fn3  p_fn3 ha_fn4  p_fn4
#>   <chr>  <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1 maize  TZA     790000 4.88e5 1.10e6 1.46e6 1.61e6 2.34e6 2.14e6 3.01e6 3.20e6 4.47e6 5.74e6 1.01e7
#> 2 maize  ZMB     362787 4.83e5 5.95e5 8.17e5 7.67e5 1.13e6 8.09e5 1.46e6 1.00e6 1.86e6 1.43e6 3.62e6
#> 3 sorgh… TZA     162000 1.35e5 4.29e5 2.88e5 6.51e5 5.84e5 5.72e5 5.31e5 7.13e5 7.30e5 8.74e5 9.71e5
#> 4 sorgh… ZMB       4524 2.86e3 2.18e4 1.50e4 3.80e4 2.25e4 4.20e4 2.65e4 6.5 e4 3.58e4 1   e5 6.11e4
#> 5 wheat  TZA       8000 6.1 e3 4.15e4 5.88e4 5.20e4 7.32e4 5.56e4 7.23e4 6   e4 8.54e4 1.49e5 1.67e5
#> 6 wheat  ZMB         41 8.1 e1 1.47e3 4.98e3 1.05e4 5.63e4 1.26e4 7.85e4 2.18e4 1.35e5 4.18e4 2.77e5

The above is somewhat more complex. We want to calculate summary stats for prod and harv_area by crop and country, but dplyr can’t easily produce the tabular output for each groups. Therefore, we separately calculate each statistic in summary, and use summarise_all() (which applies the summarizing functions to each selected variable), using funs to pass each function in to summarise_all(). In the first two lines, we specify two custom functions, the first calculates the first quartile (25th percentile), the next calculates the third quartile (75th percentile). In the pipeline, we also do some renaming gymnastics, changing harv_area to ha and prod to p, and then we specify in funs some shorter names for the output from median and mean. These renamings are in service of narrowing the width of output columns.

3.3 Associations

Evaluating associations between data variables (e.g. correlations) is one of the most frequent uses for R. Lets start with correlations:

# Chunk 34
# #1
cor(crop_ylds[, c("prod", "harv_area")])
#>                prod harv_area
#> prod      1.0000000 0.9513794
#> harv_area 0.9513794 1.0000000
# 
# #2
crop_ylds %>% 
  select(prod, harv_area) %>% 
  cor()
#>                prod harv_area
#> prod      1.0000000 0.9513794
#> harv_area 0.9513794 1.0000000
# 
# #3
cor.test(x = crop_ylds$prod, y = crop_ylds$harv_area)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  crop_ylds$prod and crop_ylds$harv_area
#> t = 60.368, df = 382, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.9408854 0.9600489
#> sample estimates:
#>       cor 
#> 0.9513794
#
# # 4
crop_ylds %>% 
  group_by(crop, country) %>% 
  summarise(cor = cor(prod, harv_area))
#> `summarise()` has grouped output by 'crop'. You can override using the `.groups` argument.
#> # A tibble: 6 × 3
#> # Groups:   crop [3]
#>   crop    country   cor
#>   <chr>   <chr>   <dbl>
#> 1 maize   TZA     0.942
#> 2 maize   ZMB     0.771
#> 3 sorghum TZA     0.858
#> 4 sorghum ZMB     0.942
#> 5 wheat   TZA     0.631
#> 6 wheat   ZMB     0.987

We first apply stats::cor using base syntax to calculate the correlation matrix between prod and harv_area (#1), and then do the same thing with dplyr. In #3 we apply the cor.test to those two variables, using base syntax, which produces statistics regarding the confidence interval and significance of the correlation coefficient. This is harder to do in dplyr, because the output from cor.test is not tabular, so we are not doing that here (but there is a tidyverse oriented corrr package that should help with that). #4 shows how to use cor with summarize when grouping by crop and country. The output is the correlation coefficient for the two variables for each crop and country over the time series (this only works for a two variable correlation analysis–it fails when a third variable is added). Note the negative correlation between maize production and harvested area in Tanzania, whereas all the others are generally strongly positive (as they should be–generally the more area you plant to a crop, the more tons of it you produce). We’ll come back to that.

The next step after a correlation is a regression, which quantifies how much a variable Y changes in response to changes in a variable X (or several different X variables). Note: This is not a statistics course, so if you have never seen a regression before, then some of what comes next might be a little opaque. I recommend that you read up on the basics of least squares regression to help understand these examples better. Here is one of many online resources.

Here’s our first regression example:

# Chunk 35
# #1
prodha_lm <- lm(prod ~ harv_area, data = crop_ylds)
prodha_lm
#> 
#> Call:
#> lm(formula = prod ~ harv_area, data = crop_ylds)
#> 
#> Coefficients:
#> (Intercept)    harv_area  
#>  -52871.524        1.514
#
# #2
summary(prodha_lm)
#> 
#> Call:
#> lm(formula = prod ~ harv_area, data = crop_ylds)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2574479  -222099    33457    84947  1860415 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -5.287e+04  2.747e+04  -1.925    0.055 .  
#> harv_area    1.514e+00  2.508e-02  60.368   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 448800 on 382 degrees of freedom
#> Multiple R-squared:  0.9051, Adjusted R-squared:  0.9049 
#> F-statistic:  3644 on 1 and 382 DF,  p-value: < 2.2e-16

In #1 above, we use R’s lm (linear model) function to fit a regression between prod (dependent variable) and harv_area (independent variable) across the entire crop_ylds dataset. Note the ~, which is used to specify the left-hand and right-hand sides of the regression formula. The output of lm is captured in prod_lm, which just specifies the model formula and the regression coefficients. To see more of the model’s results, you have to use summary on the prodha_lm, as we do in #2, which spits out quite a bit, including the model fit (significance of coefficients, R2, etc.).

Given our dataset, this relationship is not that interesting. At the very least, we would want to see the relationship by crop type, and also by crop and year. That requires some splitting:

# Chunk 36
# #1
maize_lm <- lm(prod ~ harv_area, 
               data = crop_ylds %>% filter(crop == "maize"))
#
# #2
maize_zm_lm <- lm(prod ~ harv_area, 
                  data = crop_ylds %>% 
                    filter(crop == "maize" & country == "ZMB"))
summary(maize_zm_lm)
#> 
#> Call:
#> lm(formula = prod ~ harv_area, data = crop_ylds %>% filter(crop == 
#>     "maize" & country == "ZMB"))
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1127080  -539296   159063   448827  1015488 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -6.497e+05  2.321e+05  -2.800  0.00681 ** 
#> harv_area    2.608e+00  2.732e-01   9.543 8.81e-14 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 561900 on 62 degrees of freedom
#> Multiple R-squared:  0.595,  Adjusted R-squared:  0.5884 
#> F-statistic: 91.07 on 1 and 62 DF,  p-value: 8.81e-14
#
# #3
maize_tza_lm <- lm(prod ~ harv_area, 
                  data = crop_ylds %>% 
                    filter(crop == "maize" & country == "TZA"))
summary(maize_tza_lm)
#> 
#> Call:
#> lm(formula = prod ~ harv_area, data = crop_ylds %>% filter(crop == 
#>     "maize" & country == "TZA"))
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2608668  -357294  -104795   250739  2101905 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -5.659e+05  1.861e+05  -3.041  0.00345 ** 
#> harv_area    1.672e+00  7.530e-02  22.200  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 747600 on 62 degrees of freedom
#> Multiple R-squared:  0.8883, Adjusted R-squared:  0.8865 
#> F-statistic: 492.9 on 1 and 62 DF,  p-value: < 2.2e-16

The first split is simply on crop type (#1), so the regression is maize planted area on production for across countries and years. Notice the use of dplyr pipes and verbs to split out the maize observations within the “data” argument of lm. The coefficients show us that production increases by roughly 1.52 tons per 1 ha increase in planted area. That’s interesting, because it sounds pretty close to what you would get if calculated average yield by dividing each production estimate by harvested area and then taking the average. We leave it to you to calculate mean yield in the practice below.

In #2, we split out Zambian, and in this case we see that Zambian maize production increases by 2.61 tons per ha planted area, while #3 shows that Tanzanian maize production decreases by 1.67 tons per ha. Remember that negative correlation in Chunk 34 #4 above? In the next section we will take a graphical look at the reasons for this decline.

We could also fit the lm and summary at the end of the dplyr pipeline:

# Chunk 37
crop_ylds %>% filter(crop == "maize" & country == "TZA") %>% 
  lm(prod ~ harv_area, data = .) %>% summary()
#> 
#> Call:
#> lm(formula = prod ~ harv_area, data = .)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2608668  -357294  -104795   250739  2101905 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -5.659e+05  1.861e+05  -3.041  0.00345 ** 
#> harv_area    1.672e+00  7.530e-02  22.200  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 747600 on 62 degrees of freedom
#> Multiple R-squared:  0.8883, Adjusted R-squared:  0.8865 
#> F-statistic: 492.9 on 1 and 62 DF,  p-value: < 2.2e-16

That is actually more elegant than what the code in Chunk 36. To make this work, we have to use the . in the “data” argument within lm, which represents the data.frame/tibble coming in from the upstream part of the pipeline. Note that we don’t need the . in the summary function. Try run the code above, removing the . from within lm, to see what happens.

I’ll end this very brief introduction to analysis with a look at how we could do SAC with regression, so that we can run all country X crop prod ~ harv_area regressions. We’ll start with a melange of base R and dplyr:

# Chunk 38
crop_country_lms <- lapply(unique(crop_ylds$crop), function(x) {  
  lms <- lapply(unique(crop_ylds$country), function(y) {
    dat <- crop_ylds %>% filter(crop == x & country == y)
    summary(lm(prod ~ harv_area, dat))$coefficients
  })
  names(lms) <- unique(crop_ylds$country)
  return(lms)
})
names(crop_country_lms) <- unique(crop_ylds$crop)
crop_country_lms
#> $maize
#> $maize$TZA
#>                  Estimate   Std. Error   t value     Pr(>|t|)
#> (Intercept) -5.659083e+05 1.860748e+05 -3.041294 3.449820e-03
#> harv_area    1.671763e+00 7.530326e-02 22.200408 3.337146e-31
#> 
#> $maize$ZMB
#>                  Estimate   Std. Error   t value     Pr(>|t|)
#> (Intercept) -6.497481e+05 2.320597e+05 -2.799917 6.806310e-03
#> harv_area    2.607587e+00 2.732472e-01  9.542959 8.809769e-14
#> 
#> 
#> $sorghum
#> $sorghum$TZA
#>                  Estimate   Std. Error   t value     Pr(>|t|)
#> (Intercept) -79311.770843 4.896830e+04 -1.619656 1.103823e-01
#> harv_area        1.066315 8.097334e-02 13.168723 1.274210e-19
#> 
#> $sorghum$ZMB
#>                 Estimate   Std. Error   t value     Pr(>|t|)
#> (Intercept) 2231.4367094 1.237337e+03  1.803419 7.618158e-02
#> harv_area      0.5773724 2.608042e-02 22.138158 3.896236e-31
#> 
#> 
#> $wheat
#> $wheat$TZA
#>                 Estimate   Std. Error  t value     Pr(>|t|)
#> (Intercept) 3.420561e+04 6510.7806906 5.253688 1.940746e-06
#> harv_area   6.856259e-01    0.1070689 6.403593 2.289141e-08
#> 
#> $wheat$ZMB
#>                 Estimate   Std. Error   t value     Pr(>|t|)
#> (Intercept) -7323.512886 2428.4367561 -3.015731 3.713149e-03
#> harv_area       6.831819    0.1394885 48.977638 2.818943e-51

This arrangement is similar to what we were doing in Chunk 25b, using nested lapply to split on crop and country (using dplyr:filter instead of base R syntax to do the split), but applying lm inside the bodies of the lapply. We extract only the coefficient table from summary of each lm, to condense the output.

There is a pure tidyverse way of doing this, which is the last thing to look at here. It is more involved than previous examples, because the non-tabular (list) output from lm prevents use of lm within dplyr::summarise:

# Chunk 39
# note: replacement for old variant suggested by answer here:
# https://stackoverflow.com/questions/62972843/retreiving-tidy-
# results-from-regression-by-group-with-broom
crop_ylds %>% 
  nest_by(crop, country) %>% 
  mutate(prod_ha_lm = list(lm(prod ~ harv_area, data = data))) %>%
  reframe(broom::tidy(prod_ha_lm))

This produces a nice output table that lists the coefficients (estimate), standard errors, t-statistic, and p-value for each crop-country model. Each model occupies two lines: one for the model intercept results, one for independent variable results. There are a couple of extra functions, including do, which is described by ?do as:

…a general purpose complement to the specialised manipulation functions filter(), select(), mutate(), summarise() and arrange(). You can use do() to perform arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list. This is particularly useful when working with models: you can fit models per group with do() and then flexibly extract components with either another do() or summarise().

And then there is broom::tidy, which is a function from the broom, a non-core tidyverse package. The purpose of this function is to take an input and convert it to an output data.frame, which is what allows us to produce the messy lm output into a nice table at the end of the pipeline above.

Okay, so that’s where we will leave it. Now for some practice.

3.4 Practice

3.4.1 Questions

  1. What function can you use to select a range of values from a tibble?

  2. Looking at chunk 27, and thinking about split-apply-combine operations, identify which line is responsible for splitting the data, what split(s) is (are) being made, and which line is responsible doing the applying, and what is it applying? Finally, which line is doing the combine?

  3. In Chunk 29, what did we do differently compared to Chunk 28?

  4. What is one limitation of dplyr::summarise with respect to analyzing groupings? What can we do to overcome that limitation?

3.4.2 Code

This coding practice assumes you have created the crop_ylds dataset to match its form as of Chunk 21.

  1. Use dplyr verbs to subset Tanzania sorghum yield from the year 2000 onwards. Do the same but use base R syntax.

  2. Calculate the mean and sd of prod, harv_area, and yield in the subset you just made, using dplyr::summarise_all and funs. Chunk 33 is your guide.

  3. Now calculate mean and sd of harv_area and prod by crops and by country, using dplyr. Hint: you will need to use dplyr::group_by. Chunk 33 is also a guide.

  4. Use dplyr to calculate a correlation table between yield and harv_area for Zambian maize. Use cor.test to evaluate this relationship also. Hint: you will need to use the base R approach for this from Chunk 34 #3, but to preserve the subset of Zambia yields, first assign the subset to a new object dat.

  5. Use dplyr verbs to calculate the average maize yield (across, rather than within, countries). How close is that to the regression coefficient on harv_area from Chunk 36 #1?

  6. Use lm to fit regression of between Zambia maize yield and year, with year as the independent variable. Do the same for Tanzania maize yields. Use dplyr verbs to create the subsets, but use both the base-centric (Chunk 36) and dplyr centric (Chunk 37) way of running the lms. Which country shows the largest annual changes in yields?

  7. Challenge: Adapt the code in Chunk 39 to create regressions between yield and year by country and crop, but just for maize and wheat

4 Visualization

Until now we have been looking numerical outputs from our R code. Graphical representation often conveys more information, and is typically done alongside numerical analysis. Our last section on regressions leads us naturally to the subject of visualization, as the most obvious way to assess a linear relationship between two variables is to examine their scatterplots. So let’s look at that to start, but first a quick introduction to two different plotting approaches.

4.1 graphics graphics versus grid graphics

Back in Module 3 1.1 we mentioned that there is a split in R plotting methods centered on base R’s graphics and grid packages. ggplot2 is arguably the most prominent descendant of the latter branch of graphics.

We are going to learn examples of both approaches in this section. The reasons for this are that:

  1. Many spatial packages (including sf and the newer stars) have generic plot functions that have syntax modeled on graphics::plot;
  2. The same spatial packages are also moving toward compatibility with ggplot2, which is useful to know because that is rapidly becoming a standard.

I would add that it is helpful to know both approaches, because graphics based plots are often very good for rapid interactive data exploration, whereas ggplot can be helpful for creating more polished, presentation-grade graphics.

4.2 Two-dimensional plots

Back to our regression examples. We were exploring relationships between crop production and harvested area. We could get a better sense of that relationship by plotting those two variables against one another.

4.2.1 Scatterplots

We’ll start with graphics::plot, which ?plot tells us is a:

Generic function for plotting of R objects. For more details about the graphical parameter arguments, see par.

There are many, many possible parameters to tune to in making such a plot (which you can find in ?par), but we will just look a few. Here’s a start:

# Chunk 40
plot(prod ~ harv_area, data = crop_ylds[crop_ylds$crop == "maize", ])

This is a purely base way of plotting the relationship between harv_area on the X axis and versus prod on the Y axis for just maize (not distinguishing by country). We set up the relationship between the two variables as a formula, with prod on the left-hand side (LHS) and harv_area on the right-hand side (RHS) on the equation, and the data, including the subsetting, in “data” argument slot.
We could do the same exact thing using the syntax below:

# Chunk 41
# #1
dat <- crop_ylds[crop_ylds$crop == "maize", ]
plot(x = dat$harv_area, y = dat$prod)

#
# #2
plot(x = dat$harv_area[dat$crop == "maize"], y = dat$prod[dat$crop == "maize"])

Both are more cumbersome, but note that in both cases we pass our data as separate vectors into the “x” and “y” arguments. We don’t show the plots above, but they are each identical to the one in Chunk 40, except for the axis labels.

That’s not a particularly appealing plot, so let’s dress it up a bit:

# Chunk 42a
plot(prod ~ harv_area, data = crop_ylds[crop_ylds$crop == "maize", ], 
     pch = 16, cex = 0.7, col = "blue",
     main = "Maize production versus harvested area",
     xlab = "Harvested area (ha)",
     ylab = "Production (tons)")

This adds on to the approach in Chunk 40. We use the “pch” argument to change the type of symbol, “cex” changes the size of the symbol, “col” the color of the symbol, and “main”, “xlab”, and “ylab” add titles for the top of the plot and the X and Y axis, respectively.

Looking at th plot, we can see there is a clear positive relationship in the data up to about 3 million ha planted area, and then production seems to decline. What’s going on there?

# Chunk 42b
# #1
zmb_maize <- crop_ylds %>% filter(crop == "maize" & country == "ZMB")
tza_maize <- crop_ylds %>% filter(crop == "maize" & country == "TZA")
# #2
plot(prod ~ harv_area, data = zmb_maize,      
     pch = 16, cex = 0.7, col = "blue", 
     main = "Zambia maize prod vs harv area",
     xlab = "Harvested area (ha)",
     ylab = "Production (tons)")

# #3
plot(prod ~ harv_area, data = tza_maize, 
     pch = 16, cex = 0.7, col = "red", 
     main = "Tanzania maize prod vs harv area",
     xlab = "Harvested area (ha)",
     ylab = "Production (tons)")

The difference is caused by different production/harvested area patterns in the two countries in the dataset. In the examples above, we separately plot each country’s maize dataset, using dplyr::filter to create two new subsets of the data, which we feed separately to two new plot chunks, changing the title and point color in each. This makes the reason for the original pattern (in Chunk 41) clearer, but is still not ideal. Better to put both on one plot:

# Chunk 43
# #1
xl <- crop_ylds %>% filter(crop == "maize") %>% pull(harv_area) %>% range()
yl <- crop_ylds %>% filter(crop == "maize") %>% pull(prod) %>% range()
# #2
plot(prod ~ harv_area, data = zmb_maize, 
     pch = 16, cex = 0.7, col = "blue", xlim = xl, ylim = yl, 
     main = "Maize production vs harvested area",
     xlab = "Harvested area (ha)",
     ylab = "Production (tons)")
# #3
points(prod ~ harv_area, data = tza_maize, pch = 16, cex = 0.7, col = "red")
# #4
legend(x = "topleft", legend = c("Zambia", "Tanzania"), 
       col = c("blue", "red"), pch = 20)

That’s much clearer now, and the plot now makes obvious some additional information: Tanzania’s harvested area and total maize production are larger than Zambia’s. Despite these differences in scale, both countries show a clear positive relationship between the two variables, meaning that production increases as harvested area expands rather than declining. Additionally, the steeper slope of Zambia’s cluster suggests it achieves higher yields per unit of land, whereas Tanzania exhibits greater variability in production for similar amounts of harvested area.

Putting those variables on one plot took a few additional steps. First, we had to make both datasets fit on the same scale. That required fixing the data ranges on the X and Y axis of the plot. plot would otherwise automatically set the range to max the first input dataset (Zambian maize). So (in #1) we first figured out the X and Y ranges by using dplyr’s filter, and pull to extract the harv_area and prod vectors and separately calculate their ranges into new variables (xl and yl). We then (in #2) passed these in to plot to set the “xlim” and “ylim” values (controlling the range of the X and Y axes, respectively). plot was applied to the zmb_maize. We then used points (in #3) to add the tza_maize data to the existing plot. points does what it’s name says–it adds points to a plot. An existing plot must already be established for points to work. The arguments are the same as those used in plot. Finally, in #4, we add a legend so that the reader can distinguish between the two sets of points. The “x” argument says where the legend should be placed in the graph, “legend” asks the legend text, “col” for the point colors, and “pch” for what the point types should be.

That was a fair bit of work. Could we save some effort with dplyr piping?

# Chunk 44
crop_ylds %>% filter(crop == "maize" & country == "ZMB") %>% 
  plot(prod ~ harv_area, data = ., 
       pch = 16, cex = 0.7, col = "blue", xlim = xl, ylim = yl, 
       main = "Maize production vs harvested area",
       xlab = "Harvested area (ha)",
       ylab = "Production (tons)")
crop_ylds %>% filter(crop == "maize" & country == "TZA") %>% 
  points(prod ~ harv_area, data = ., pch = 16, cex = 0.7, col = "red")
legend(x = "topleft", legend = c("Zambia", "Tanzania"), 
       col = c("blue", "red"), pch = 20)

A little bit, it seems. We can subset the dataset using filter, and then pipe each subset to plot and points, respectively. To do this, we give . to the “data” argument in each function in order to receive the piped dataset.

That’s not in ideal pipeline example, however, because dplyr pipes are really more designed to work with the tidyverse’s plotting tools, i.e. ggplot2.

4.2.1.1 Introducing ggplot2

Now’s a good time to introduce ggplot2. Let’s recreate the first graph as a start:

# Chunk 45
library(ggplot2)
crop_ylds %>% filter(crop == "maize") %>% 
  ggplot() + geom_point(mapping = aes(x = harv_area, y = prod), col = "blue") +
  xlab("Harvested area (ha)") + ylab("Production (tons)") + 
  ggtitle("Maize production vs harvested area")

That’s a basic recreation of the same plot, although ggplot by default has a grey gridded background (which can be removed if needed). You can see that the syntax is very different than it is with plot. ggplot is based on a “grammar of graphics”, on which there a whole book, but also a paper by Hadley Wickham. There is a lot of information there, but some text from ggplot2 tidyverse page is instructive:

ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details…It’s hard to succinctly describe how ggplot2 works because it embodies a deep philosophy of visualisation. However, in most cases you start with ggplot(), supply a dataset and aesthetic mapping (with aes()). You then add on layers (like geom_point() or geom_histogram()), scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()).

There are many functions in ggplot2 (the package) that control plotting functionality, and we won’t go through them all here (the cheatsheet is a nice resource though). So we will go over just a few core parts, using the example in Chunk 45. We initiated the whole thing from a dplyr pipeline that we used to subset the maize part of crop_ylds, which we piped to ggplot, which (according to ?ggplot) “intializes a ggplot object”. It is empty in this case because it is receiving the piped input, otherwise we would to give a data.frame (let’s note there that ggplot only works with data.frames, unlike plot, which can work with vectors).

We then pass that to geom_point, using +, which is somewhat analogous to dplyr’s %>%, but in this case is used to build up ggplot objects. geom_point is the geometry function for creating scatterplots. There are many different geom_* functions in ggplot2, one for each kind of plot. We’ll see more of those later.

Inside geom_point we use the function aes to satisfy the “mapping” argument. Is stands for “aesthetic mapping”, which “…describe how variables in the data are mapped to visual properties (aesthetics) of geoms.” (from ?aes). xlab, ylab, and ggtitle should be fairly self-explanatory.

An alternate was of constructing the same plot is this:

# Chunk 46
ggplot(data = crop_ylds %>% filter(crop == "maize"), 
       mapping = aes(x = harv_area, y = prod)) +
  geom_point(col = "blue") + xlab("Harvested area (ha)") + 
  ylab("Production (tons)") + ggtitle("Maize production vs harvested area")

That is not run, to save space, but is identical. In this case, insert the tibble directly into ggplot’s “data” argument, and the aes is supplied to the “mapping” argument. We still need to supply the geom_point argument, but now we just tell what point color we want (“col” argument). The rest is the same.

So the above examples just show ggplot with prod versus harv across both countries. How would we distinguish between the countries, as we did in Chunks 43 and 44? By making just two small changes to the code we used in Chunk 45 (same would apply to Chunk 46)

# Chunk 47
crop_ylds %>% filter(crop == "maize") %>% 
  ggplot() + geom_point(aes(x = harv_area, y = prod, color = country)) +
  xlab("Harvested area (ha)") + ylab("Production (tons)") + 
  ggtitle("Maize production vs harvested area")

All we did here was supply the “color” argument inside aes, which in this case is the country variable. This means that geom_point (note we drop the argument name for “mapping” now, which is fine) will assign a different color to each group of points based on their unique values in country variable. It then very conveniently 1) creates a legend, and 2) automatically scales the axes to cover the full data ranges, and, of course, 3) assigns a different color to each point grouping. The other change we made was to remove the “col” argument that was originally outside aes in Chunk 45, which assigned a single color to all points. That’s a key point to remember: “color” (or “colour”–you have probably already noticed that some tidyverse functions have variants for both UK and US English spellings) is supplied inside aes to create differentiated colors, while col is used outside, and will map a single color.

I don’t much care for ggplot default color palette, so I like to change them:

# Chunk 48
crop_ylds %>% filter(crop == "maize") %>% 
  ggplot() + geom_point(aes(x = harv_area, y = prod, color = country)) +
  scale_color_manual(values = c("red", "blue")) + 
  xlab("Harvested area (ha)") + ylab("Production (tons)") + 
  ggtitle("Maize production vs harvested area")

Which in this case is done using scale_color_manual and the vector of color names pssed to “values”.

4.2.2 Line plots

Right, those are some very basic scatter plots we just looked at, and what they are showing is that maize production and harvested area have differing relationships between countries. Let’s explore this some more, and look at some acquire some new plotting skills in the process.

Remembering that yield production/harvested area (tons/ha), and the data we are examining are a time series, we might get a better sense of the data by plotting yield over time:

# Chunk 49
crop_ylds %>% filter(crop == "maize") %>% 
  ggplot() + geom_point(aes(x = year, y = yield, color = country)) +
  scale_color_manual(values = c("red", "blue")) + 
  xlab("Year") + ylab("Yield (tons / ha)") + 
  ggtitle("Maize yield (1961-2024)")

Interesting. It seems that both countries show an increasing trend in yield since 1961, with a more rapid increase after 2000, particularly in Zambia. But the point are noisy, so maybe visualizing these as lines would make things a clearer?

# Chunk 50
crop_ylds %>% filter(crop == "maize") %>% 
  ggplot() + geom_line(mapping = aes(x = year, y = yield, color = country)) +
  scale_color_manual(values = c("red", "blue")) + 
  xlab("Year") + ylab("Yield (tons / ha)") + 
  ggtitle("Maize yields (1961-2024)")

That does make things a bit clearer–and it suggests that Zambian yields diverged from Tanzanian yields in about the 2000s. There was a huge drop in maize yield in 2000s in Tanzania. The lines also better illustrate the between-year variations in yields, which seem to be larger in Zambia than Tanzania. All we had to do to make the change was swap out geom_point for geom_line.

For comparison, here is how we would do that same plot with graphics methods:

# Chunk 51
crop_ylds %>% filter(crop == "maize" & country == "ZMB") %>% 
  plot(yield ~ year, data = ., type = "l", col = "blue", 
       ylim = range(crop_ylds %>% filter(crop == "maize") %>% pull(yield)), 
       main = "Maize yield (1961-2024)")
crop_ylds %>% filter(crop == "maize" & country == "TZA") %>% 
  lines(yield ~ year, data = ., col = "red")

Pretty much the same set-up as Chunk 44, using the type = "l" argument in plot to plot lines instead of points for the Zambian maize yields, and then the lines function (instead of points) to add the Tanzanian yield time series.

4.2.3 Points and lines

Those line comparisons still only give us a visual comparison. Fitting a trend line to the points would make things clearer. We can do that quite easily with ggplot:

# Chunk 52
crop_ylds %>% filter(crop == "maize") %>% 
  ggplot() + geom_point(aes(x = year, y = yield, color = country)) +
  geom_smooth(aes(x = year, y = yield, color = country)) + 
  scale_color_manual(values = c("red", "blue")) + 
  xlab("Year") + ylab("Yield (tons / ha)") + 
  ggtitle("Maize yields (1961-2024)")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The above uses geom_smooth to add a best-fitting curve through each country’s yield time series. This function defaults to loess, which fits a smoothing spline, or local polynomial regression, to the points (see ?loess for more details). It also provides a confidence interval around that fit. This now shows the steeper yield increase in Zambia starting around 2000.

We can also just examine the linear (rather than curvilinear) trends:

# Chunk 53
crop_ylds %>% filter(crop == "maize") %>% 
  ggplot() + geom_point(aes(x = year, y = yield, color = country)) +
  geom_smooth(aes(x = year, y = yield, color = country), method = "lm") + 
  scale_color_manual(values = c("red", "blue")) + 
  xlab("Year") + ylab("Yield (tons / ha)") + 
  ggtitle("Maize yield (1961-2024)")
#> `geom_smooth()` using formula = 'y ~ x'

That gives us a straight line fit through each time series, which is calling on the lm function to fit the trendlines, which shows a steeper line over the whole time series.

With graphics::plot, we have to fit the trendlines more explicitly:

# Chunk 54
zmb_maize <- crop_ylds %>% filter(crop == "maize" & country == "ZMB")
tza_maize <- crop_ylds %>% filter(crop == "maize" & country == "TZA")
yl <- range(crop_ylds[crop_ylds$crop == "maize", "yield"])
plot(yield ~ year, data = zmb_maize, pch = 16, col = "blue", 
     ylim = yl, main = "Maize yield (1961-2024)")
abline(lm(yield ~ year, data = zmb_maize), col = "blue")
points(yield ~ year, data = tza_maize, pch = 16, col = "red")
abline(lm(yield ~ year, data = tza_maize), col = "red")

We recreate the two subset tibbles zmb_maize and tza_maize, create a new variable yl containing the range of yields across both datasets, plot points for zmb_maize, and use lm to fit the regression between yield and year in zmb_maize, then wrap that in the abline function, which converts the regression model to a prediction line (i.e. the trendline). We then points with tza_maize to add the scatter plot for Tanzanian maize, and fit another abline on the lm for that dataset. That’s more involved than the ggplot approach.

4.3 One-dimensional plots

We have been looking at plots using two variables so far. Now let’s look at some of the more commonly used visualizations for single variables. Let’s start with a histogram of sorghum yields, starting first with ggplot:

# Chunk 55
crop_ylds %>% filter(crop == "sorghum") %>%
  ggplot() + geom_histogram(aes(x = yield), fill = "blue", col = "black") +
  ggtitle("Distribution of sorghum yields")
#> `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

We grab sorghum out of crop_ylds, and then geom_histogram with an aes mapping yield to the x variable. We specify the color for the histogram bar “fill” and the bar outline color (“col”) outside of aes. The output suggests that we suggest a better bin width. Indeed, the bars are too narrow and there are data gaps. Let’s try again with a different number of bins:

# Chunk 56
crop_ylds %>% filter(crop == "sorghum") %>%
  ggplot() + 
  geom_histogram(aes(x = yield), bins = 15, fill = "blue", col = "black") +
  ggtitle("Distribution of sorghum yields")

We specify that we want 15 bins, rather than the default 30. Looks a bit better. Here’s how do it using graphics::hist:

# Chunk 57
hist(x = crop_ylds$yield[crop_ylds$crop == "sorghum"], xlab = "yield", 
     main = "Distribution of sorghum yields", col = "blue")

We’ll keep focusing on ggplot though. How about a histogram breaking sorghum yields down by country?

# Chunk 58
crop_ylds %>% filter(crop == "sorghum") %>%
  ggplot() + 
  geom_histogram(aes(x = yield, color = country), bins = 15) +
  ggtitle("Distribution of sorghum yields")

As with scatter plots, specifying the “color” argument within aes separates the bars by country, making stacked histograms. What about changing the fill, rather than the bar outline color?

# Chunk 59
crop_ylds %>% filter(crop == "sorghum") %>%
  ggplot() + 
  geom_histogram(aes(x = yield, fill = country), bins = 15) +
  ggtitle("Distribution of sorghum yields") + 
  scale_fill_manual(values = c("red", "blue")) 

We specify that fill = country in aes. Okay, so now it is clear that in polygonal object, we distinguish between “fill” and “color” within aes (or “fill” and “col” outside of it). And if we want to change the fill colors from ggplot’s defaults, as we did with scatter plots above, we use the scale_fill_manual function.

We could also make this a side-by-side histogram, rather than a stacked one:

# Chunk 60
crop_ylds %>% filter(crop == "sorghum") %>%
  ggplot() + 
  geom_histogram(aes(x = yield, fill = country), bins = 15, 
                 position = "dodge", col = "black") +
  ggtitle("Distribution of sorghum yields") + 
  scale_fill_manual(values = c("red", "blue")) 

We achieve that by using the argument position = "dodge" (outside of aes). We also use col = "black" outside of aes to add a black line around each box.

So those are histograms. How about a simpler barplot for counting a categorical variable?

# Chunk 62
crop_ylds %>% filter(crop == "wheat") %>% 
  mutate(yld_levels = ifelse(yield < 3, 1, 2)) %>% 
  mutate(yld_levels = ifelse(yield > 6, 3, yld_levels)) %>% 
  ggplot() + geom_bar(aes(x = yld_levels), fill = "blue", col = "black") + 
  xlab("Yield levels")

Here we use geom_bar to make a bar plot from the categorical variable yld_levels, which we create using mutate and ifelse to group wheat yields into 3 categories (yield < 3 t/ha; 3-6 t/ha; >6 t/ha). We make the bars blue-filled and outlined with black box.

Last plot we will show is the boxplot:

# Chunk 63
crop_ylds %>% ggplot() + geom_boxplot(aes(x = crop, y = yield)) + xlab(NULL)

Pretty straightforward. We get a separate boxplot for each crop by specifying crop as the “x” argument and yield as “y” in aes within geom_boxplot. Another way to show the distribution of values within a variable. We can also easily distinguish separate boxplots for each country:

# Chunk 64
crop_ylds %>% 
  ggplot() + geom_boxplot(aes(x = crop, y = yield, fill = country)) + 
  xlab(NULL) + scale_fill_manual(values = c("red", "blue")) 

We simply make country the variable on which to “fill” in aes, and then change the fill color to red and blue.

4.4 Multi-panel plots

Thinking back to our scatter plots, there is more to the story regarding the differing yield trends between Tanzania and Zambia, and it requires multiple plots to really tell it. Let’s have a look at yield, production, and harvested area trends over time to see what’s going on:

# Chunk 65
# #1
p1 <- crop_ylds %>% filter(crop == "maize") %>% 
  ggplot() + geom_point(aes(x = year, y = yield, color = country)) + 
  scale_color_manual(values = c("red", "blue")) + 
  ylab("Yield (tons/ha)") + xlab("") +
  geom_smooth(aes(x = year, y = yield, color = country))
p2 <- crop_ylds %>% filter(crop == "maize") %>% 
  ggplot() + geom_point(aes(x = year, y = prod, color = country)) + 
  scale_color_manual(values = c("red", "blue")) + 
  ylab("Production (tons)") + xlab("") +
  geom_smooth(aes(x = year, y = prod, color = country))
p3 <- crop_ylds %>% filter(crop == "maize") %>% 
  ggplot() + geom_point(aes(x = year, y = harv_area, color = country)) + 
  scale_color_manual(values = c("red", "blue")) + 
  ylab("Harvested area (ha)") + xlab("") +
  geom_smooth(aes(x = year, y = harv_area, color = country))
# #2
gp <- cowplot::plot_grid(p1 + theme(legend.position = "none"), 
                         p2 + theme(legend.position = "none"), 
                         p3 + theme(legend.position = "none"), nrow = 1,
                         align = "vh", axis = "l")
# #3
gp2 <- cowplot::plot_grid(gp, cowplot::get_legend(p1), rel_widths = c(3, 0.3))
# #4
# REPLACE THE filename WITH YOUR OWN!!!!
ggsave(gp2, filename = "materials/fig/u1m4-1.png", width = 9, height = 2.5, 
       units = "in", dpi = 300) 

This is a bit more advanced, but worth having a look at. In #1, we replicate our code from Chunk 52, using aes() with the “color” argument to create separate plots by country. We create one ggplot object for each variable yield, prod, and harv_area. We assign each plot to its own object, which prevents it from printing to the graphics device.

In #2, we use the function plot_grid from the cow_plot package to arrange the three plots into a single plot. We specify that the plots will fall onto one row (nrow = 1), and that it should align the three plots vertically and horizontally (align = "vh") against the left axis (axis = "l"). We use an additional argument assigned to each plot object, which tells the plot to render without its legend (theme(legend.position = "none")). That is done to prevent three identical legends being created, one for each plot, which would take up valuable plot real estate.

In #3, we use cowplot::plot_grid again, putting the three panel plot object as the first argument, and then use a function to grab the legend from one of the individual plots (cowplot::get_legend(p1)), which will make it appear to the right of the three scatter plots. rel_widths(c(3, 0.3) specifies the relative width of each plot in the row, so the legend will take <10% of the total plot space (0.3 / (3 + 0.3)).

In #4 we use ggsave to write the whole plot to a png, which we then read back in to display within the vignette (see the Rmd code). The reason for that is to avoid making sdadata dependent on cowplot to build vignettes. So to run this code you will have to install.packages(cowplot).

So what is the story told by this plot? Both countries began with similar agricultural profiles in 1960, their paths diverged significantly over time. Tanzania experienced a massive expansion in harvested area that fueled a corresponding surge in total production, whereas Zambia’s harvested area remained relatively stable and small-scale. Despite this difference in land expansion, Zambia achieved much higher yield gains, producing more tons per hectare than Tanzania by the 2020s. Ultimately, Tanzania’s production growth was driven primarily by land expansion with modest yield improvements, while Zambia focused on intensifying productivity on a much smaller land footprint.

We just showed a more complicated way of making a multi-panel plot. There is a simpler way to do this with ggplot2 functions, provided that the variables on your two axes don’t change between plot panels. We can illustrate this by showing the yield time series for each crop:

# Chunk 66
crop_ylds %>% 
  ggplot(.) + geom_point(aes(x = year, y = yield, color = country)) +
  geom_smooth(aes(x = year, y = yield, color = country)) + 
  facet_grid(cols = vars(crop)) +
  scale_color_manual(values = c("red", "blue")) + 
  ylab("Yield (tons/ha)") + xlab("")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

In this example, we use ggplot2::facet_grid to split the output into three plots, one per crop type, specifying that we want the values in vars(crop) to split into columns. For facet_grid, you have to wrap the column name in vars().

We can also specify that by rows:

# Chunk 67
crop_ylds %>% 
  ggplot() + geom_point(aes(x = year, y = yield, color = country)) +
  geom_smooth(aes(x = year, y = yield, color = country)) + 
  facet_grid(rows = vars(crop)) +
  scale_color_manual(values = c("red", "blue")) + 
  ylab("Yield (tons/ha)") + xlab("")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

So that’s it for our visualization section. There are many other graphics we could make, and we will learn some of these as we continue into the spatial unit of the course.

4.5 Practice

4.5.1 Questions

  1. What are some of the primary differences between plots made using the graphics package and those made with ggplot2?

  2. ggplot2 seems much prettier and maybe more convenient. So why are we learning about base graphics plots as well?

  3. How do the axis labels in the two examples given in Chunk 41 differ from the plot in Chunk 40? How could you change the names of the labels?

  4. You can use graphics::plot with dplyr pipelines. In order for it to work, what argument and argument value do you have to do to use?

  5. How do you change the point color, point style, and point size in a graphics::plot?

  6. What happens when you run ggplot without a geom_* argument? You can run the code in Chunk 45 with the geom_point part removed to check.

4.5.2 Code

  1. Run the code in Chunk 56 without the “fill” and “col” arguments. Then run it, with just fill = "red" (no “col” argument).

  2. Using crop_ylds, split out the Tanzania wheat data (with dplyr) and use plot to make a time series between year and harv_area, using first points and then lines (hint: use the type = "l" argument for lines). Make the points solid blue and the lines blue. Give a title “Tanzania Wheat (1961-2024)”, and make the axis labels empty for the year axis and “Harvested area (ha)” for the y axis.

  3. Recreate the same two plots (labels, title, colors) using ggplot’s geom_point and then geom_line.

  4. Building on the previous, create a ggplot that separates the harv_area trends by country. Use geom_line to show the two trends. Change the default ggplot colors to blue for Zambia and red for Tanzania (i.e. use scale_fill_manual). Extra: You will see that the Tanzania harv_area values are large relative to Zambia’s, so it is hard to see how they change over time. Wrap the log10() function around harv_area in aes, and replot for a more informative visualization.

  5. Plot just the Tanzania wheat time series, but use geom_points and geom_smooth to plot the point with a smooth trend through it. In other words, use the code you used for the first part of code practice 3 above, but add geom_smooth.

  6. Create a histogram from Zambia’s wheat planted area. Do it using both ggplot2 and graphics::hist methods. Add titles and informative x axis labels to both. Make the bars blue with black outlines.

  7. Let’s display wheat, sorghum, and maize harvested areas for Tanzania on side-by-side scatter plots, each with a smooth trendline fit through it. That requires adapting Chunk 66’s code. Hints: leave out color = country in geom_points and geom_smooth, and use dplyr to filter for Tanzanian data.

5 Unit assignment

5.1 Set-up

Make sure you are working in the main branch of your project (you should not be working on the a2 or a3 branch). Create a new vignette named “unit1_module4.Rmd”. By now, you should be fluent in switching branches and creating vignettes. If not, please revisit modules 1–3 to refresh your memory. You will use this to document the tasks you undertake for this assignment.

There are no package functions to create for this assignment. All work is to be done in the vignette.

5.2 Vignette tasks

  1. Create three tibbles, t1, t2, t3. t1 has 10 rows, with column V1 containing values G1 - G10, and V2 containing runif (randomly select numbers from a uniform distribution) between 75 and 125. t2 has 15 rows with v1 (G1 - G15) and v3 containing a random selection of LETTERS A-F. t3 has 20 rows (v1 with G1-G20), and v4 with numbers from the random normal distribution (rnorm, mean = 100, sd = 20). Use a seed of 1 for random numbers. Join t1, t2, and t3 within a single pipeline, using:
  • left_join
  • right_join
  • full_join
  • inner_join
  1. Recreate the crop_ylds dataset, using 1) an lapply to read in each .csv file from the sdadata package extdata/ folder, and 2) the dplyr steps necessary to *_join the data and make the necessary mutate-ations. Chunks 1, 11, 16, 19, and 21 are your guides.

A quick reminder of what crop_ylds should look like:

head(crop_ylds)
#> # A tibble: 6 × 6
#>   crop  country  year harv_area   prod yield
#>   <chr> <chr>   <dbl>     <dbl>  <dbl> <dbl>
#> 1 maize TZA      1961    790000 590000 0.747
#> 2 maize TZA      1962    800000 600000 0.75 
#> 3 maize TZA      1963    960000 850000 0.885
#> 4 maize TZA      1964    930000 720000 0.774
#> 5 maize TZA      1965    950000 751000 0.791
#> 6 maize TZA      1966   1100000 880000 0.8
  1. Use dplyr verbs to select the 5 top-ranked years for total harvested area for Tanzanian maize. Do the same for Tanzanian maize yields. To do this, you will need to use filter, arrange, and slice. The outputs for each test should be the 5 rows of crop_ylds that meet these criteria.

  2. Calculate the mean of each crop’s yield (across both countries) using SAC based on dplyr, as well as an sapply using base R syntax within the sapply to subset on crop (note, subsetting a tibble is a bit different, so use this syntax to do the job within the sapply: mean(crop_ylds[crop_ylds$crop == x, ]$yield))

  3. Calculate a correlation matrix between harv_area and yield for each crop-country combination, using dplyr verbs. Arrange the result (negative to positive) by the value of the correlation coefficient. See Chunk 34 for guidance.

  4. Create a single scatter plot with ggplot that shows the relationship between harv_area (x-axis) and yield (y-axis) for maize, separated by country on a single plot. Make it a point scatter plot, with a straight trendline fit through each set of points (i.e. method = "lm" in geom_smooth). You will need to use geom_point and geom_smooth. Make a title (“Harvested area versus yield”) and x (“Harvested area (ha)”) and y (“Yield (t/ha)”) labels.

  5. Create a single scatter plot with graphics::plot that plots just Tanzanian wheat yields (y-axis) against year (x-axis). Plot the points, and then add a linear trendline to it, by wrapping the abline around the lm function. Make the points solid grey (“grey”) and the abline blue. Label the y axis as “Yield (t/ha)”. Remove the x-axis label. Give a title: “Tanzanian wheat (1961-2024)”. Chunk 54 is your guide.

  6. Use ggplot to make a 5-bin histogram of Zambia’s maize yields. The x-label should be “Yield (t/ha)”, the title should be “Zambian Maize”, and bins should be blue with black outlines.

5.3 Assignment output

Refer to Unit 1 Module 3’s Assignment output section for submission instructions. The differences are as follows:

  1. You should increment your package version number by 1 on the lowest number (e.g. from 0.1.1 to 0.1.2) in the DESCRIPTION;

  2. Control the width of your plots in 6-8 using the following chunk options:

    ```{r, fig.width=4.75, fig.height=3.75, fig.align = "center"}
    ```
  3. Revise the setting chunk in the beginning of your vignette to:

    knitr::opts_chunk$set(
      collapse = TRUE,
      comment = "#>",
      warning = FALSE,
      message = FALSE
    )

    to avoid unnecessary warning and message.

  4. Your submission should be on a new side branch “a3”.


Back to home